Unsupervised Learning: Trade & Ahead¶

Problem Statement¶

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Security: Full name of the ticker symbol's underlying security (human-friendly full name of stock, fund, etc.).
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE (Return on Equity): A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
#!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
In [ ]:
# Import libraries we need to do our work

# Read and manipulate data
import numpy as np
import pandas as pd

# HTML styling
from IPython.display import display, HTML

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from scipy.stats import shapiro

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")
In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
#!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
#!pip install --upgrade -q jinja2
In [ ]:
# Load Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# Define path to data file
path = r'/content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/7-Unsupervised Learning/Project 7/'
df = pd.read_csv(path + 'stock_data.csv')

Data Overview¶

  • Observations
  • Sanity checks
In [ ]:
# Iterate through each column and determine its data type

# Calculate number of rows and columns
num_rows, num_cols = df.shape

# Initialize dictionary to store counts of each data type
data_type_counts = {}

# Iterate through each column and determine its data type
for col in df.columns:
    data_type = str(df[col].dtype)
    if data_type in data_type_counts:
        data_type_counts[data_type] += 1
    else:
        data_type_counts[data_type] = 1

# # Check size of dataset and print summary information
print(f'The data has {num_rows} rows and {num_cols} columns total:')
print()
for dtype, count in data_type_counts.items():
    print(f'{count} columns of type {dtype}')
The data has 340 rows and 15 columns total:

4 columns of type object
7 columns of type float64
4 columns of type int64
In [ ]:
# View information
print('Training')
display(df.info())
Training
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
None
In [ ]:
# Describe the data. Can also use include=all to also show categorical series.
df.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Ticker Symbol 340 340 AAL 1 NaN NaN NaN NaN NaN NaN NaN
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net Cash Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327937 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings Per Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated Shares Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585
In [ ]:
# Check for empty values
print(f'Dataset has {df.isnull().sum().sum()} null values:',)

# Replace any missing values with NaNs
df = df.replace("?", np.nan)
Dataset has 0 null values:
In [ ]:
# Retrieve and preview the categorial column values, using a list.
df_eda = df.copy()

# Create numerical and categorical column lists
list_catcol = df_eda.select_dtypes(include=['object']).columns.to_list()
list_numcol = df_eda.select_dtypes(exclude=['object']).columns.to_list()

print('Dataset has', len(list_catcol), 'categorical columns and', len(list_numcol), 'numerical columns.')
Dataset has 4 categorical columns and 11 numerical columns.
In [ ]:
# Use a for loop to show the counts and percent contribution of each numerical column.
for col in list_numcol:
  print(f'***** {col} *****')
  print('Preview counts for:',df_eda[col].value_counts(),'\n')
  print('Preview % breakdown for:',df_eda[col].value_counts(normalize=True),'\n')
***** Current Price *****
Preview counts for: Current Price
54.360001     2
114.379997    2
43.230000     2
32.930000     2
103.889999    1
             ..
51.070000     1
52.130001     1
81.589996     1
302.399994    1
47.919998     1
Name: count, Length: 336, dtype: int64 

Preview % breakdown for: Current Price
54.360001     0.005882
114.379997    0.005882
43.230000     0.005882
32.930000     0.005882
103.889999    0.002941
                ...   
51.070000     0.002941
52.130001     0.002941
81.589996     0.002941
302.399994    0.002941
47.919998     0.002941
Name: proportion, Length: 336, dtype: float64 

***** Price Change *****
Preview counts for: Price Change
 9.999995     1
 6.237855     1
 6.585554     1
 5.796884     1
 9.529966     1
             ..
 8.037605     1
 10.019650    1
-4.078593     1
 3.654238     1
 16.678836    1
Name: count, Length: 340, dtype: int64 

Preview % breakdown for: Price Change
 9.999995     0.002941
 6.237855     0.002941
 6.585554     0.002941
 5.796884     0.002941
 9.529966     0.002941
                ...   
 8.037605     0.002941
 10.019650    0.002941
-4.078593     0.002941
 3.654238     0.002941
 16.678836    0.002941
Name: proportion, Length: 340, dtype: float64 

***** Volatility *****
Preview counts for: Volatility
1.687151    1
1.023375    1
1.460619    1
1.281566    1
2.168814    1
           ..
1.056186    1
1.308082    1
1.941104    1
1.404508    1
1.610285    1
Name: count, Length: 340, dtype: int64 

Preview % breakdown for: Volatility
1.687151    0.002941
1.023375    0.002941
1.460619    0.002941
1.281566    0.002941
2.168814    0.002941
              ...   
1.056186    0.002941
1.308082    0.002941
1.941104    0.002941
1.404508    0.002941
1.610285    0.002941
Name: proportion, Length: 340, dtype: float64 

***** ROE *****
Preview counts for: ROE
10     24
9      16
14     16
11     16
17     15
       ..
582     1
116     1
68      1
63      1
142     1
Name: count, Length: 82, dtype: int64 

Preview % breakdown for: ROE
10     0.070588
9      0.047059
14     0.047059
11     0.047059
17     0.044118
         ...   
582    0.002941
116    0.002941
68     0.002941
63     0.002941
142    0.002941
Name: proportion, Length: 82, dtype: float64 

***** Cash Ratio *****
Preview counts for: Cash Ratio
99     37
47     19
25     10
70      8
10      7
       ..
82      1
128     1
175     1
164     1
100     1
Name: count, Length: 124, dtype: int64 

Preview % breakdown for: Cash Ratio
99     0.108824
47     0.055882
25     0.029412
70     0.023529
10     0.020588
         ...   
82     0.002941
128    0.002941
175    0.002941
164    0.002941
100    0.002941
Name: proportion, Length: 124, dtype: float64 

***** Net Cash Flow *****
Preview counts for: Net Cash Flow
 159000000    2
 39000000     2
 160383000    2
-8000000      2
 23000000     2
             ..
 4073000      1
-14756000     1
 523803000    1
 2196000      1
-43623000     1
Name: count, Length: 332, dtype: int64 

Preview % breakdown for: Net Cash Flow
 159000000    0.005882
 39000000     0.005882
 160383000    0.005882
-8000000      0.005882
 23000000     0.005882
                ...   
 4073000      0.002941
-14756000     0.002941
 523803000    0.002941
 2196000      0.002941
-43623000     0.002941
Name: proportion, Length: 332, dtype: float64 

***** Net Income *****
Preview counts for: Net Income
1899000000    2
355000000     2
848000000     2
1034000000    2
636056000     2
             ..
268000000     1
232120000     1
878485000     1
85171000      1
339000000     1
Name: count, Length: 335, dtype: int64 

Preview % breakdown for: Net Income
1899000000    0.005882
355000000     0.005882
848000000     0.005882
1034000000    0.005882
636056000     0.005882
                ...   
268000000     0.002941
232120000     0.002941
878485000     0.002941
85171000      0.002941
339000000     0.002941
Name: proportion, Length: 335, dtype: float64 

***** Earnings Per Share *****
Preview counts for: Earnings Per Share
3.28    5
3.71    4
3.03    4
1.80    4
1.58    4
       ..
5.87    1
1.86    1
1.77    1
1.31    1
1.20    1
Name: count, Length: 268, dtype: int64 

Preview % breakdown for: Earnings Per Share
3.28    0.014706
3.71    0.011765
3.03    0.011765
1.80    0.011765
1.58    0.011765
          ...   
5.87    0.002941
1.86    0.002941
1.77    0.002941
1.31    0.002941
1.20    0.002941
Name: proportion, Length: 268, dtype: float64 

***** Estimated Shares Outstanding *****
Preview counts for: Estimated Shares Outstanding
1.992378e+08    4
5.139877e+08    3
1.115226e+08    3
1.469542e+09    2
1.633016e+09    2
               ..
3.671392e+08    1
5.777662e+07    1
5.457799e+08    1
1.485114e+08    1
4.985294e+08    1
Name: count, Length: 323, dtype: int64 

Preview % breakdown for: Estimated Shares Outstanding
1.992378e+08    0.011765
5.139877e+08    0.008824
1.115226e+08    0.008824
1.469542e+09    0.005882
1.633016e+09    0.005882
                  ...   
3.671392e+08    0.002941
5.777662e+07    0.002941
5.457799e+08    0.002941
1.485114e+08    0.002941
4.985294e+08    0.002941
Name: proportion, Length: 323, dtype: float64 

***** P/E Ratio *****
Preview counts for: P/E Ratio
93.089287    17
20.819876     6
24.070121     5
28.407929     3
17.313076     3
             ..
18.636363     1
21.175469     1
55.829113     1
7.575269      1
70.470585     1
Name: count, Length: 295, dtype: int64 

Preview % breakdown for: P/E Ratio
93.089287    0.050000
20.819876    0.017647
24.070121    0.014706
28.407929    0.008824
17.313076    0.008824
               ...   
18.636363    0.002941
21.175469    0.002941
55.829113    0.002941
7.575269     0.002941
70.470585    0.002941
Name: proportion, Length: 295, dtype: float64 

***** P/B Ratio *****
Preview counts for: P/B Ratio
-1.269332     6
-3.089477     5
-3.973395     4
 3.954975     4
-4.327138     4
             ..
-14.151445    1
 5.108756     1
 4.427425     1
 5.884467     1
 1.723068     1
Name: count, Length: 299, dtype: int64 

Preview % breakdown for: P/B Ratio
-1.269332     0.017647
-3.089477     0.014706
-3.973395     0.011765
 3.954975     0.011765
-4.327138     0.011765
                ...   
-14.151445    0.002941
 5.108756     0.002941
 4.427425     0.002941
 5.884467     0.002941
 1.723068     0.002941
Name: proportion, Length: 299, dtype: float64 

In [ ]:
# Use a for loop to show the counts and percent contribution of each categorical column.
for col in list_catcol:
  print(f'***** {col} *****')
  print('Preview counts for:',df_eda[col].value_counts(),'\n')
  print('Preview % breakdown for:',df_eda[col].value_counts(normalize=True),'\n')
***** Ticker Symbol *****
Preview counts for: Ticker Symbol
AAL     1
NEE     1
NUE     1
NTRS    1
NSC     1
       ..
EQR     1
EQIX    1
EOG     1
EMN     1
ZTS     1
Name: count, Length: 340, dtype: int64 

Preview % breakdown for: Ticker Symbol
AAL     0.002941
NEE     0.002941
NUE     0.002941
NTRS    0.002941
NSC     0.002941
          ...   
EQR     0.002941
EQIX    0.002941
EOG     0.002941
EMN     0.002941
ZTS     0.002941
Name: proportion, Length: 340, dtype: float64 

***** Security *****
Preview counts for: Security
American Airlines Group    1
NextEra Energy             1
Nucor Corp.                1
Northern Trust Corp.       1
Norfolk Southern Corp.     1
                          ..
Equity Residential         1
Equinix                    1
EOG Resources              1
Eastman Chemical           1
Zoetis                     1
Name: count, Length: 340, dtype: int64 

Preview % breakdown for: Security
American Airlines Group    0.002941
NextEra Energy             0.002941
Nucor Corp.                0.002941
Northern Trust Corp.       0.002941
Norfolk Southern Corp.     0.002941
                             ...   
Equity Residential         0.002941
Equinix                    0.002941
EOG Resources              0.002941
Eastman Chemical           0.002941
Zoetis                     0.002941
Name: proportion, Length: 340, dtype: float64 

***** GICS Sector *****
Preview counts for: GICS Sector
Industrials                    53
Financials                     49
Health Care                    40
Consumer Discretionary         40
Information Technology         33
Energy                         30
Real Estate                    27
Utilities                      24
Materials                      20
Consumer Staples               19
Telecommunications Services     5
Name: count, dtype: int64 

Preview % breakdown for: GICS Sector
Industrials                    0.155882
Financials                     0.144118
Health Care                    0.117647
Consumer Discretionary         0.117647
Information Technology         0.097059
Energy                         0.088235
Real Estate                    0.079412
Utilities                      0.070588
Materials                      0.058824
Consumer Staples               0.055882
Telecommunications Services    0.014706
Name: proportion, dtype: float64 

***** GICS Sub Industry *****
Preview counts for: GICS Sub Industry
Oil & Gas Exploration & Production            16
REITs                                         14
Industrial Conglomerates                      14
Electric Utilities                            12
Internet Software & Services                  12
                                              ..
Technology Hardware, Storage & Peripherals     1
Real Estate Services                           1
Trucking                                       1
Networking Equipment                           1
Casinos & Gaming                               1
Name: count, Length: 104, dtype: int64 

Preview % breakdown for: GICS Sub Industry
Oil & Gas Exploration & Production            0.047059
REITs                                         0.041176
Industrial Conglomerates                      0.041176
Electric Utilities                            0.035294
Internet Software & Services                  0.035294
                                                ...   
Technology Hardware, Storage & Peripherals    0.002941
Real Estate Services                          0.002941
Trucking                                      0.002941
Networking Equipment                          0.002941
Casinos & Gaming                              0.002941
Name: proportion, Length: 104, dtype: float64 

In [ ]:
# Check counts, unique counts and duplicate values for each column. Based on this, we can decide how to approach each variable.
for col in df_eda:
  unique_count = df_eda[col].nunique()
  duplicate_count = df_eda[col].count()-df_eda[col].nunique()
  empty_count = df_eda[col].isnull().sum().sum()
  print(f'Series {col} has {unique_count} unique values, {duplicate_count} duplicate values, and {empty_count} missing values.')
Series Ticker Symbol has 340 unique values, 0 duplicate values, and 0 missing values.
Series Security has 340 unique values, 0 duplicate values, and 0 missing values.
Series GICS Sector has 11 unique values, 329 duplicate values, and 0 missing values.
Series GICS Sub Industry has 104 unique values, 236 duplicate values, and 0 missing values.
Series Current Price has 336 unique values, 4 duplicate values, and 0 missing values.
Series Price Change has 340 unique values, 0 duplicate values, and 0 missing values.
Series Volatility has 340 unique values, 0 duplicate values, and 0 missing values.
Series ROE has 82 unique values, 258 duplicate values, and 0 missing values.
Series Cash Ratio has 124 unique values, 216 duplicate values, and 0 missing values.
Series Net Cash Flow has 332 unique values, 8 duplicate values, and 0 missing values.
Series Net Income has 335 unique values, 5 duplicate values, and 0 missing values.
Series Earnings Per Share has 268 unique values, 72 duplicate values, and 0 missing values.
Series Estimated Shares Outstanding has 323 unique values, 17 duplicate values, and 0 missing values.
Series P/E Ratio has 295 unique values, 45 duplicate values, and 0 missing values.
Series P/B Ratio has 299 unique values, 41 duplicate values, and 0 missing values.
In [ ]:
# Chart high-level assessment for each numerical series
for col in list_numcol:
  mean = df[col].mean()
  median = df[col].median()
  min = df[col].min()
  max = df[col].max()

  this_range = max-min
  print('*****')
  print('For', col, ':')
  print(' * Mean =', mean)
  print(' * Median =' , median)
  print(' * Range =', min, 'to' , max,'(', round(this_range,3), ')')
*****
For Current Price :
 * Mean = 80.86234477216765
 * Median = 59.705
 * Range = 4.5 to 1274.949951 ( 1270.45 )
*****
For Price Change :
 * Mean = 4.078193812467648
 * Median = 4.819504729
 * Range = -47.12969338 to 55.05168339 ( 102.181 )
*****
For Volatility :
 * Mean = 1.5259762538411765
 * Median = 1.3855929255000001
 * Range = 0.733163184 to 4.58004173 ( 3.847 )
*****
For ROE :
 * Mean = 39.59705882352941
 * Median = 15.0
 * Range = 1 to 917 ( 916 )
*****
For Cash Ratio :
 * Mean = 70.02352941176471
 * Median = 47.0
 * Range = 0 to 958 ( 958 )
*****
For Net Cash Flow :
 * Mean = 55537620.5882353
 * Median = 2098000.0
 * Range = -11208000000 to 20764000000 ( 31972000000 )
*****
For Net Income :
 * Mean = 1494384602.9411764
 * Median = 707336000.0
 * Range = -23528000000 to 24442000000 ( 47970000000 )
*****
For Earnings Per Share :
 * Mean = 2.776661764705882
 * Median = 2.895
 * Range = -61.2 to 50.09 ( 111.29 )
*****
For Estimated Shares Outstanding :
 * Mean = 577028337.7540295
 * Median = 309675137.79999995
 * Range = 27672156.86 to 6159292035.0 ( 6131619878.14 )
*****
For P/E Ratio :
 * Mean = 32.612563318264705
 * Median = 20.81987609
 * Range = 2.935450768 to 528.0390742 ( 525.104 )
*****
For P/B Ratio :
 * Mean = -1.7182493633029414
 * Median = -1.0671703205
 * Range = -76.11907749 to 129.0645854 ( 205.184 )
In [ ]:
# Check for null values
print(f'EDA dataset has {df_eda.isnull().sum().sum()} null values:',)

# Replace any missing values with NaNs
df_eda = df_eda.replace("?", np.nan)

# Count NaN values
nan_count = df_eda.isnull().sum().sum()

if nan_count > 0:
    print(f"EDA DataFrame has {nan_count} NaN values.")
else:
    print("EDA DataFrame does not have NaN values.")
EDA dataset has 0 null values:
EDA DataFrame does not have NaN values.
In [ ]:
# Preview rows with missing data
df_eda[df_eda.isnull().any(axis=1)]
Out[ ]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
In [ ]:
# Preview a random sample of the data
df_eda.sample(n=10, random_state=1)
Out[ ]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.55 4.065823e+08 93.089287 1.785616
125 FB Facebook Information Technology Internet Software & Services 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.31 2.800763e+09 79.893133 5.884467
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.52 1.636250e+08 26.335526 -1.269332
248 PG Procter & Gamble Consumer Staples Personal Products 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.28 4.913916e+08 24.070121 -2.256747
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.23 7.652981e+08 93.089287 3.345102
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 4.353535e+08 17.682214 -3.838260
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.56 1.520911e+08 93.089287 9.567952
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.79 8.493671e+08 93.089287 17.345857
89 DFS Discover Financial Services Financials Consumer Finance 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.14 4.468872e+08 10.431906 -0.375934
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.26 4.283628e+08 14.814159 4.218620

OBSERVATIONS

  • The raw dataset has 340 rows and 15 columns total:

    • 4 columns of type object

      • Ticker Symbol
      • Security
      • GICS Sector
      • GICS Sub Industry
    • 7 columns of type float64

      • Current Price
      • Price Change
      • Volatility
      • Earnings Per Share
      • Estimated Shares Outstanding
      • P/E Ratio
      • P/B Ratio
    • 4 columns of type int64

      • ROE
      • Cash Ratio
      • Net Cash Flow
      • Net Income
  • Some duplicate data exists. We'll need to examine these to see if we expect duplicates in these features, or if we need to de-dupe the dataset.

  • No values are null or missing.

Exploratory Data Analysis (EDA)¶

As part of our analysis, we will answer the following questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

EDA Helpers¶

The following functions will assist us in our analysis.

In [ ]:
# Import libraries we need
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, StandardScaler
In [ ]:
# The functions below are defined here so we can avoid repeating similar code for commonly used plots.
# These functions are called throughout this notebook.

# This function prints a barplot
def print_barplot(data, column, width, height, titleText, hue=None):
    '''
    Print a single, sorted barplot with title and labels and optional hue.
    '''

    # Set figure size and title
    plt.figure(figsize=(width, height))
    plt.title(titleText)
    plt.xticks(rotation=90)

    # Sort by values
    data_sorted = data[column].value_counts().sort_values(ascending=False).reset_index()
    data_sorted.columns = ['category', 'counts']

    # Plotting
    if hue is None:
        sns.barplot(x='category', y='counts', data=data_sorted)
    else:
        sns.barplot(x='category', y='counts', hue=hue, data=data_sorted)

    plt.xlabel('Categories')
    plt.ylabel('Counts')
    plt.show()
In [ ]:
# This function creates a labeled barplot

def print_labeled_barplot(data, feature, perc=False, n=None):
    '''
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    '''

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# Function to plot a boxplot and a histogram on same scale

def histogram_boxplot(data, feature, figsize=(10, 5), kde=False, bins=None):
  '''
  Boxplot and histogram combined

  data: dataframe
  feature: dataframe column
  figsize: size of figure (default (10,5))
  kde: whether to the show density curve (default False)
  bins: number of bins for histogram (default None)
  '''
  # Create a figure with two subplots
  f2, (ax_box2, ax_hist2) = plt.subplots(
      nrows=2,  # Number of rows of the subplot grid
      sharex=True,  # x-axis will be shared among all subplots
      gridspec_kw={'height_ratios': (0.25, 0.75)},  # Ratio of subplot heights
      figsize=figsize,  # Size of the figure
  )

  # Boxplot
  sns.boxplot(
      data=data, x=feature, ax=ax_box2, showmeans=True, color='blue'
  )
  ax_box2.set_title(f'Boxplot of {feature}')  # Title for the boxplot

  # Histogram
  if bins:
      sns.histplot(
          data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette='winter'
      )
  else:
      sns.histplot(
          data=data, x=feature, kde=kde, ax=ax_hist2
      )

  ax_hist2.axvline(
      data[feature].mean(), color='green', linestyle="--", label='Mean'
  )
  ax_hist2.axvline(
      data[feature].median(), color='black', linestyle="-", label='Median'
  )
  ax_hist2.set_title(f'Histogram of {feature}')  # Title for the histogram
  ax_hist2.legend()  # Add legend to the histogram plot

  # Adjust layout
  plt.tight_layout()
  plt.show()
In [ ]:
# Function to calculate Z-score
def calculate_z_score(value, mean, std):
    '''Calculate the Z-score for a single value '''
    if std == 0:
        return 0  # Avoid division by zero
    return (value - mean) / std
In [ ]:
# Function to print the statistics for each feature
def print_numerical_feature_statistics(data, feature):
    '''
    Show the statistics of a feature.

    data: dataframe
    feature: dataframe column
    '''

    # Calculate Shapiro-Wilks to test normal distribution
    stat, p = shapiro(data[feature])

    # Calculate mean and standard deviation
    mean = data[feature].mean()
    std = data[feature].std()

    # Calculate Z-score for the mean
    z_score = calculate_z_score(mean, mean, std)

    new_stats = {
        'Feature': feature,
        'Mean': mean,
        'Median': data[feature].median(),
        'Min': data[feature].min(),
        'Max': data[feature].max(),
        'Range': data[feature].max() - data[feature].min(),
        'Standard Deviation': std,
        'Variance': data[feature].var(),
        'P-value': p
    }

    new_stats_df = pd.DataFrame([new_stats])

    return new_stats_df

Univariate Analysis¶

First, we want to analyze each feature independently.

In [ ]:
# Analyze critical statistics

# Create empty dataframe to store the vital statistics
stat_columns = [
    'Feature',
    'Mean',
    'Median',
    'Min',
    'Max',
    'Range',
    'Standard Deviation',
    'Variance',
    'P-value']

df_stats = pd.DataFrame(columns=stat_columns)
In [ ]:
# Drop ticker symbol and security for this purpose, since they are unique
df_eda.drop(columns=['Ticker Symbol','Security'], axis=1, inplace=True)

# Recreate numerical and categorical column lists
list_catcol = df_eda.select_dtypes(include=['object']).columns.to_list()
list_numcol = df_eda.select_dtypes(exclude=['object']).columns.to_list()
In [ ]:
# Populate stats and show them
for feature in list_numcol:
    new_stats_df = print_numerical_feature_statistics(df_eda.select_dtypes(include=['number']), feature)
    df_stats = pd.concat([df_stats, new_stats_df], ignore_index=True)

# Set the display precision
pd.set_option('display.float_format', '{:.6f}'.format)

# Show statistics
df_stats
Out[ ]:
Feature Mean Median Min Max Range Standard Deviation Variance P-value
0 Current Price 80.862345 59.705000 4.500000 1274.949951 1270.449951 98.055086 9614.799896 0.000000
1 Price Change 4.078194 4.819505 -47.129693 55.051683 102.181377 12.006338 144.152149 0.000000
2 Volatility 1.525976 1.385593 0.733163 4.580042 3.846879 0.591798 0.350225 0.000000
3 ROE 39.597059 15.000000 1.000000 917.000000 916.000000 96.547538 9321.427130 0.000000
4 Cash Ratio 70.023529 47.000000 0.000000 958.000000 958.000000 90.421331 8176.017144 0.000000
5 Net Cash Flow 55537620.588235 2098000.000000 -11208000000.000000 20764000000.000000 31972000000.000000 1946365312.175789 3788337928441157632.000000 0.000000
6 Net Income 1494384602.941176 707336000.000000 -23528000000.000000 24442000000.000000 47970000000.000000 3940150279.327937 15524784223688019968.000000 0.000000
7 Earnings Per Share 2.776662 2.895000 -61.200000 50.090000 111.290000 6.587779 43.398830 0.000000
8 Estimated Shares Outstanding 577028337.754030 309675137.800000 27672156.860000 6159292035.000000 6131619878.140000 845849595.417695 715461538068278272.000000 0.000000
9 P/E Ratio 32.612563 20.819876 2.935451 528.039074 525.103623 44.348731 1966.809957 0.000000
10 P/B Ratio -1.718249 -1.067170 -76.119077 129.064585 205.183663 13.966912 195.074625 0.000000
In [ ]:
# Drop numerical columns for charting object histograms
df_eda_plot_cat = df_eda.copy().drop(columns=list_numcol, axis=1)
In [ ]:
# Show sectors as a visual

# Create counts and percentages
value_counts = df_eda_plot_cat['GICS Sector'].value_counts().reset_index()
value_counts.columns = ['Category', 'Count']
total_counts = value_counts['Count'].sum()
value_counts['Percentage'] = (value_counts['Count'] / total_counts) * 100
value_counts_sorted = value_counts.sort_values(by='Count', ascending=False)

# Plot sector contributions
plt.figure(figsize=(12, 8))
bar_plot = sns.barplot(x='Category', y='Count', data=value_counts_sorted, order=value_counts_sorted['Category'], hue='Category')

plt.xticks(rotation=90)
plt.xlabel('Categories')
plt.ylabel('Counts')
plt.title('Sector Counts')

for p in bar_plot.patches:
    count = p.get_height()
    percentage = value_counts_sorted["Percentage"].iloc[int(p.get_x() + p.get_width() / 2)]
    bar_plot.annotate(f'{count:.1f}\n({percentage:.1f}%)',
                      (p.get_x() + p.get_width() / 2., count),
                      ha='center', va='bottom')
In [ ]:
# Print value contributions for each sector
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

display(value_counts_sorted)

# Since our dataset is small, we'll leave this display option intact for later
#pd.set_option('display.max_rows', 10)
#pd.set_option('display.max_columns', 10)
Category Count Percentage
0 Industrials 53 15.588235
1 Financials 49 14.411765
2 Health Care 40 11.764706
3 Consumer Discretionary 40 11.764706
4 Information Technology 33 9.705882
5 Energy 30 8.823529
6 Real Estate 27 7.941176
7 Utilities 24 7.058824
8 Materials 20 5.882353
9 Consumer Staples 19 5.588235
10 Telecommunications Services 5 1.470588
In [ ]:
# Show sub sector contributions as a visual

# Create counts and percentages
value_counts_sub = df_eda_plot_cat['GICS Sub Industry'].value_counts().reset_index()
value_counts_sub.columns = ['Category', 'Count']
total_counts_sub = value_counts_sub['Count'].sum()
value_counts_sub['Percentage'] = (value_counts_sub['Count'] / total_counts_sub) * 100
value_counts_sorted_sub = value_counts_sub.sort_values(by='Count', ascending=False)

# Plot sector contributions
plt.figure(figsize=(20, 8))
bar_plot = sns.barplot(x='Category', y='Count', data=value_counts_sorted_sub, order=value_counts_sorted_sub['Category'], hue='Category')

plt.xticks(rotation=90)
plt.xlabel('Categories')
plt.ylabel('Counts')
plt.title('Sector Counts')

plt.show()
In [ ]:
# Print value contributions for each sub sector
print('Top 10 Sub Sectors')
display(value_counts_sorted_sub.head(10))
Top 10 Sub Sectors
Category Count Percentage
0 Oil & Gas Exploration & Production 16 4.705882
2 Industrial Conglomerates 14 4.117647
1 REITs 14 4.117647
3 Electric Utilities 12 3.529412
4 Internet Software & Services 12 3.529412
5 Health Care Equipment 11 3.235294
6 MultiUtilities 11 3.235294
7 Banks 10 2.941176
8 Property & Casualty Insurance 8 2.352941
9 Diversified Financial Services 7 2.058824
In [ ]:
print('Bottom Sub Sectors')
display(value_counts_sorted_sub[value_counts_sorted_sub['Percentage'] < .3])
Bottom Sub Sectors
Category Count Percentage
93 Electronic Equipment & Instruments 1 0.294118
85 Financial Exchanges & Data 1 0.294118
86 Water Utilities 1 0.294118
87 Home Entertainment Software 1 0.294118
88 Drug Retail 1 0.294118
89 Electrical Components & Equipment 1 0.294118
90 Semiconductor Equipment 1 0.294118
91 Multi-line Insurance 1 0.294118
92 Copper 1 0.294118
96 Tires & Rubber 1 0.294118
94 Diversified Commercial Services 1 0.294118
95 Consumer Electronics 1 0.294118
83 Agricultural Products 1 0.294118
97 Industrial Materials 1 0.294118
98 Motorcycle Manufacturers 1 0.294118
99 Technology Hardware, Storage & Peripherals 1 0.294118
100 Real Estate Services 1 0.294118
101 Trucking 1 0.294118
102 Networking Equipment 1 0.294118
84 Metal & Glass Containers 1 0.294118
79 Multi-Sector Holdings 1 0.294118
82 Distributors 1 0.294118
70 Human Resource & Employment Services 1 0.294118
62 Gold 1 0.294118
63 Environmental Services 1 0.294118
64 Home Furnishings 1 0.294118
65 Household Appliances 1 0.294118
66 Apparel, Accessories & Luxury Goods 1 0.294118
67 Specialty Retail 1 0.294118
68 Life Sciences Tools & Services 1 0.294118
69 Publishing 1 0.294118
71 Steel 1 0.294118
81 Computer Hardware 1 0.294118
72 Housewares & Specialties 1 0.294118
73 Thrifts & Mortgage Finance 1 0.294118
74 Brewers 1 0.294118
75 Technology, Hardware, Software and Supplies 1 0.294118
76 Personal Products 1 0.294118
77 Industrial Gases 1 0.294118
78 Office REITs 1 0.294118
80 Alternative Carriers 1 0.294118
103 Casinos & Gaming 1 0.294118
In [ ]:
# Drop categorical columns this time, so we can chart histograms and boxplots for the numerical features
df_eda_plot_num = df_eda.copy().drop(columns=list_catcol, axis=1)

# Analyze features
for feature in df_eda_plot_num.columns:
  histogram_boxplot(df_eda_plot_num, feature, figsize=(8, 5), kde=True, bins=None)

OBSERVATIONS

  • The sectors break down as follows, in terms of sub sector industry security counts:

    • Industrials: 53 (15.59%)
    • Financials: 49 (14.41%)
    • Health Care: 40 (11.76%)
    • Consumer Discretionary: 40 (11.76%)
    • Information Technology: 33 (9.71%)
    • Energy: 30 (8.82%)
    • Real Estate: 27 (7.94%)
    • Utilities: 24 (7.06%)
    • Materials: 20 (5.88%)
    • Consumer Staples: 19 (5.59%)
    • Telecommunications Services: 5 (1.5%)
  • Many of the features exhibit skewness, while a few appear normal. Most have outliers. In addition, the ranges of data vary on scales that are quite different. We will likely need to scale our data and calculate z-scores later prior to clustering the data.

    • Right-skewed

      • Current Price
      • Volatility
      • ROE
      • Cash Ratio
      • Estimated Shares Outstanding
      • P/E Ratio
    • Fairly normal

      • Price Change
      • Net Cash Flow
      • Net Income
      • Earnings Per Share
      • P/B Ratio

    For Current Price :

    • Mean = 80.86234477216765
    • Median = 59.705
    • Range = 4.5 to 1274.949951 ( 1270.45 )

    For Price Change :

    • Mean = 4.078193812467648
    • Median = 4.819504729
    • Range = -47.12969338 to 55.05168339 ( 102.181 )

    For Volatility :

    • Mean = 1.5259762538411765
    • Median = 1.3855929255000001
    • Range = 0.733163184 to 4.58004173 ( 3.847 )

    For ROE :

    • Mean = 39.59705882352941
    • Median = 15.0
    • Range = 1 to 917 ( 916 )

    For Cash Ratio :

    • Mean = 70.02352941176471
    • Median = 47.0
    • Range = 0 to 958 ( 958 )

    For Net Cash Flow :

    • Mean = 55537620.5882353
    • Median = 2098000.0
    • Range = -11208000000 to 20764000000 ( 31972000000 )

    For Net Income :

    • Mean = 1494384602.9411764
    • Median = 707336000.0
    • Range = -23528000000 to 24442000000 ( 47970000000 )

    For Earnings Per Share :

    • Mean = 2.776661764705882
    • Median = 2.895
    • Range = -61.2 to 50.09 ( 111.29 )

    For Estimated Shares Outstanding :

    • Mean = 577028337.7540295
    • Median = 309675137.79999995
    • Range = 27672156.86 to 6159292035.0 ( 6131619878.14 )

    For P/E Ratio :

    • Mean = 32.612563318264705
    • Median = 20.81987609
    • Range = 2.935450768 to 528.0390742 ( 525.104 )

    For P/B Ratio :

    • Mean = -1.7182493633029414
    • Median = -1.0671703205
    • Range = -76.11907749 to 129.0645854 ( 205.184 )
  • The five most common GICS Sectors by percentage and counts, out of 340 total:

    • Industrials: 15.6% (53)
    • Financials: 14.4% (49)
    • Health Care: 11.8% (40)
    • Consumer Discretionary: 11.8% (40)
    • Information Technology: 9.7% (33)
  • The five least common GICS Sectors by percentage and counts, out of 340 total:

    • Real Estate: 7.9% (27)
    • Utilities: 7.1% (24)
    • Materials: 5.9% (20)
    • Consumer Staples: 5.6% (19)
    • Telecommunications Services: 1.5% (5)
  • The top 5 sub industry sectors are:

    • Oil & Gas Exploration & Production: 4.7% (16)
    • Industrial Conglomerates: 4.1% (14)
    • REITs: 4.1% (14)
    • Electric Utilities: 3.5% (12)
    • Internet Software & Services: 3.5% (12)
  • We can also see that 42 of the worst-performing sub industry sectors have contributions of less than .3%.

Bivariate Analysis¶

Next, let's identify any relationships between variables.

In order to do this, we need to ensure all variables can be compared. As we can see below, we have several categorical variables.

  • Ticker Symbol: We will drop this unique value
  • Security: We will also drop this unique value
  • GICS Industry Sector: We will convert this to ranked numerical data
  • GICS Sub Industry Sector: We will likely drop this at first, and investigate more deeply across the top or bottom industries, once we can assess relationships of the parent industry sectors.
In [ ]:
df_eda.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GICS Sector                   340 non-null    object 
 1   GICS Sub Industry             340 non-null    object 
 2   Current Price                 340 non-null    float64
 3   Price Change                  340 non-null    float64
 4   Volatility                    340 non-null    float64
 5   ROE                           340 non-null    int64  
 6   Cash Ratio                    340 non-null    int64  
 7   Net Cash Flow                 340 non-null    int64  
 8   Net Income                    340 non-null    int64  
 9   Earnings Per Share            340 non-null    float64
 10  Estimated Shares Outstanding  340 non-null    float64
 11  P/E Ratio                     340 non-null    float64
 12  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(2)
memory usage: 34.7+ KB
In [ ]:
# Create new rankings for GICS Sector
unique_sectors = df_eda['GICS Sector'].unique()
ranked_categories = sorted(unique_sectors)
rank_mapping = {}

# Add a dictionary for friendly names
friendly_name_mapping = {}

# Populate mapping and rank
for rank, category in enumerate(ranked_categories):
    rank_mapping[category] = rank + 1
    friendly_name_mapping[category] = f"Rank {rank + 1}: {category}"  # Example friendly name

# Map the rankings and create a new column in the DataFrame
df_eda['GICS Sector Int'] = df_eda['GICS Sector'].map(rank_mapping)
In [ ]:
# Drop the non-numerical columns
cols_to_drop = ['GICS Sector', 'GICS Sub Industry']
df_eda_corr = df_eda.copy().drop(columns=cols_to_drop, axis=1)

# Preview new dataframe we will use for correlation heatmap
df_eda_corr
Out[ ]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio GICS Sector Int
0 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.390000 668129938.500000 3.718174 -8.784219 6
1 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.150000 1633015873.000000 18.806350 -8.750068 5
2 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.940000 1504421769.000000 15.275510 -0.394171 5
3 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.260000 499643650.800000 74.555557 4.199651 7
4 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.310000 2247993548.000000 178.451613 1.059810 7
5 36.680000 -12.017268 1.516493 10 49 -189000000 1849000000 2.990000 618394648.800000 12.267559 7.496831 2
6 276.570007 6.189286 1.116976 30 25 90885000 596541000 8.910000 66951851.850000 31.040405 129.064585 7
7 43.230000 2.174424 1.124186 9 14 287000000 636000000 2.600000 244615384.600000 16.626923 -0.719497 11
8 58.270000 2.371753 1.068485 11 9 13900000 2052300000 3.130000 421897810.200000 18.456543 -3.022649 11
9 59.900002 3.027181 1.048295 14 99 -308000000 2533000000 5.880000 430782312.900000 10.187075 -1.883912 4
10 61.970001 8.358108 1.106965 2 99 -129000000 2196000000 1.690000 1299408284.000000 36.668640 -4.327138 4
11 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.520000 163625000.000000 26.335526 -1.269332 9
12 80.540001 1.897773 1.112604 3 99 -30351000 141555000 2.080000 68055288.460000 38.721154 -4.072615 4
13 40.939999 -0.606943 1.052050 10 99 166000000 356800000 2.070000 172367149.800000 19.777777 -9.855706 4
14 52.630001 -23.790903 1.384502 10 225 50823000 321406000 1.800000 178558888.900000 29.238889 4.282358 7
15 56.009998 26.461948 1.974323 10 13 -2276034000 334906000 3.010000 111264451.800000 18.607973 -13.649723 8
16 80.510002 2.066436 1.773431 35 74 -34000000 848000000 6.610000 128290469.000000 12.180031 -1.114658 6
17 62.090000 6.592275 1.053266 11 99 -162000000 2171000000 5.120000 424023437.500000 12.126953 -4.327138 4
18 65.919998 13.753230 1.283795 601 45 -90800000 153900000 1.600000 96187500.000000 41.199999 -0.877453 6
19 190.750000 22.338380 2.022921 2 195 66000000 144000000 0.680000 211764705.900000 280.514706 -14.171389 5
20 18.670000 26.834239 1.460030 18 131 1795000000 1377000000 1.130000 1218584071.000000 16.522124 3.857756 7
21 53.590000 2.212474 1.089266 18 37 3390000 590859000 2.460000 240186585.400000 21.784553 -4.490342 6
22 159.759995 -6.611335 2.093065 18 67 13200000 516000000 9.490000 54373024.240000 16.834562 -31.015380 4
23 162.330002 17.163478 1.630259 25 362 413000000 6939000000 9.150000 758360655.700000 17.740984 24.012322 5
24 106.419998 -2.420686 1.222260 22 67 -281000000 1562000000 8.600000 181627907.000000 12.374418 -13.398380 4
25 96.949997 10.233087 1.165804 10 39 7194000 685074000 1.420000 482446478.900000 68.274646 -20.813577 9
26 675.890015 32.268105 1.460386 4 58 1333000000 596000000 1.280000 465625000.000000 528.039074 3.904430 1
27 59.660000 2.350316 1.480914 19 1 -1300000 442600000 3.930000 112620865.100000 15.180662 -7.970104 1
28 139.440002 -0.620053 1.511654 11 70 -38200000 2560000000 9.730000 263103802.700000 14.330935 -31.006773 5
29 92.209999 3.910301 1.105032 23 99 10000000 1385000000 4.930000 280933062.900000 18.703854 -7.759856 4
30 44.470001 11.397804 2.405408 917 80 698000000 -23528000000 -61.200000 384444444.400000 93.089287 4.970809 3
31 48.580002 -20.802083 2.435165 52 22 -6430000000 -6692000000 -13.180000 507738998.500000 93.089287 -12.860938 3
32 52.230000 2.693667 1.007762 24 175 768300000 763500000 2.470000 309109311.700000 21.145749 8.202923 7
33 7.398807 1.647784 2.592065 3 37 42000000 -322000000 -0.310000 1038709677.000000 18.687607 2.639814 6
34 38.709999 23.319529 1.886335 11 70 -3025000000 892000000 1.210000 737190082.600000 31.991735 0.290291 7
35 184.130005 4.857630 1.132875 8 47 -108953000 741733000 5.540000 133886823.100000 33.236463 -3.089477 9
36 145.149994 17.902683 1.847180 29 163 218000000 1364000000 2.860000 1044055944.000000 18.249407 3.954975 7
37 59.750000 8.596874 1.171525 9 4 22000000 476000000 2.660000 178947368.400000 22.462406 -4.895294 11
38 69.550003 -6.216290 0.900066 25 99 474000000 5163000000 3.900000 506660363.000000 10.263506 -0.609074 4
39 144.589996 10.105078 1.155905 82 24 -431000000 5176000000 7.520000 688297872.300000 19.227393 22.032612 6
40 16.830000 8.440722 1.418688 6 99 20764000000 15888000000 4.180000 845069512.200000 13.004785 -0.938007 4
41 38.150002 16.702365 1.204526 11 128 -712000000 968000000 1.780000 543820224.700000 21.432585 8.637045 5
42 37.810001 5.940045 1.077678 8 99 1386000000 2084000000 2.590000 804633204.600000 14.598456 -0.852562 4
43 189.440002 1.549182 1.394436 9 82 -9600000 135400000 1.800000 75222222.220000 105.244446 -4.127770 5
44 46.150002 -12.312367 2.559553 12 84 584000000 -1967000000 -4.490000 438084632.500000 93.089287 13.490544 3
45 306.350006 4.917982 1.825994 38 133 148900000 3547000000 15.380000 230624187.300000 19.918726 0.162602 5
46 41.220001 5.422003 1.201660 8 99 -433000000 3158000000 2.730000 1156776557.000000 15.098901 -3.321298 4
47 72.730003 16.535816 1.386684 22 10 32600000 280900000 2.050000 137024390.200000 35.478050 -3.895657 8
48 68.790001 16.081680 1.498872 11 53 -3186000000 1565000000 0.940000 1664893617.000000 73.180852 0.588026 5
49 18.440001 11.757582 1.491764 4 13 -268000000 -239000000 -0.180000 1327777778.000000 31.468962 -3.880921 5
50 43.230000 3.470563 2.058769 17 25 -220100000 609700000 2.720000 224154411.800000 15.893382 1.044815 1
51 127.540001 7.203497 1.089469 10 47 -1039361000 583106000 3.790000 153853825.900000 33.651715 -1.269332 9
52 51.750000 4.714695 1.261984 8 99 -11208000000 17242000000 5.410000 3187060998.000000 9.565619 -1.167533 4
53 67.959999 3.550209 1.493553 17 25 -881000000 2512000000 3.540000 709604519.800000 19.197740 6.264053 6
54 116.849998 13.193834 0.944847 10 99 1120000000 2834000000 8.710000 325373134.300000 13.415614 -17.466101 4
55 34.580002 8.197757 1.297857 20 12 -200481000 547132000 1.640000 333617073.200000 21.085367 -3.415302 9
56 86.449997 9.569068 0.960191 21 36 3190000 1520992000 4.440000 342565765.800000 19.470720 -10.666679 9
57 54.480000 8.938217 1.347239 7 20 1064000000 1757000000 2.260000 777433628.300000 24.106195 -7.477166 1
58 119.760002 8.448793 2.000828 27 333 758700000 1602000000 2.020000 793069306.900000 59.287130 -4.320051 5
59 40.810001 -9.250611 2.368186 16 25 -1710600000 664900000 2.970000 223872053.900000 13.740741 -0.393528 8
60 26.190001 10.273688 1.189236 4 99 -191000000 840000000 1.550000 541935483.900000 16.896775 -0.063096 4
61 42.439999 1.047615 0.929026 20 38 -93000000 410400000 3.130000 131118210.900000 13.559105 -9.428134 2
62 4.500000 -38.101788 4.559815 687 22 -3283000000 -14685000000 -22.430000 654703522.100000 28.407929 -1.840528 3
63 62.020000 -9.008221 1.185473 44 12 39289000 509699000 3.520000 144800852.300000 17.619318 1.117804 6
64 183.100006 3.598507 1.697942 589 0 2000000 -271000000 -2.430000 111522633.700000 20.819876 -76.119077 1
65 146.330002 8.682415 1.588398 17 70 548000000 2094000000 8.170000 256303549.600000 17.910649 -8.805281 5
66 59.169998 9.777358 0.935812 10 99 -47000000 634000000 3.870000 163824289.400000 15.289405 -4.327138 4
67 66.620003 4.781379 0.895471 463 27 -119000000 1384000000 1.530000 904575163.400000 43.542486 -0.548324 2
68 41.830002 1.900129 1.557655 7 99 76000000 521000000 2.930000 177815699.700000 14.276451 -0.063096 4
69 90.599998 -2.402245 1.323348 6 99 326500000 1247000000 3.710000 336118598.400000 24.420485 -58.649536 4
70 479.850006 -33.131268 2.474002 22 237 -171460000 475602000 15.300000 31085098.040000 31.362745 17.201329 1
71 88.010002 -18.884791 1.472364 19 48 -590000000 1399000000 7.860000 177989821.900000 11.197201 21.439428 6
72 36.080002 1.949712 1.037844 14 12 39000000 537000000 1.900000 282631578.900000 18.989475 0.063687 11
73 65.809998 21.712591 2.298696 16 70 150000000 355000000 2.990000 118729097.000000 22.010033 -1.305493 5
74 18.360001 1.436470 1.389867 20 43 -34000000 -692000000 -1.610000 429813664.600000 17.313076 0.516503 11
75 72.180000 -0.619575 1.364592 9 99 781000000 4050000000 7.150000 566433566.400000 10.095105 -0.725643 4
76 17.690001 -20.099360 3.055818 6 0 -20440000 -113891000 -0.280000 406753571.400000 93.089287 -0.223147 3
77 134.199997 -9.672215 1.555057 8 3 -8796000 203523000 1.790000 140335195.500000 33.994412 0.855096 5
78 25.950001 -4.349421 1.626219 17 74 -41000000 1968000000 2.000000 984000000.000000 12.975001 0.902439 6
79 25.160000 0.159232 1.522194 6 3 -2000000 878000000 1.580000 555696202.500000 15.924051 -13.383212 10
80 60.020000 -4.654489 1.338123 17 182 115100000 1623600000 2.670000 608089887.600000 22.479401 7.121644 7
81 75.650002 9.021477 1.968864 16 52 108369000 319361000 2.010000 158886069.700000 37.636817 -1.765013 7
82 97.769997 1.326561 1.487367 14 11 -22000000 5237000000 4.660000 1123819742.000000 20.980686 -7.019809 2
83 89.959999 12.844955 1.750655 3 43 -1763000000 4587000000 2.460000 1864634146.000000 36.569105 4.763937 3
84 92.860001 -2.744032 2.692546 1 38 228529000 65900000 0.540000 122037037.000000 171.962965 5.673991 3
85 67.639999 -3.988642 0.889931 15 8 289000000 1899000000 3.210000 591588785.000000 21.071651 -7.604945 11
86 50.689999 13.375084 1.444219 42 31 -116000000 4526000000 5.680000 796830985.900000 8.924296 -16.730022 6
87 66.599998 37.489677 1.577881 20 60 -1610000000 1953000000 2.170000 900000000.000000 30.691243 3.252222 8
88 76.269997 3.952561 1.551946 29 22 375200000 1940000000 4.030000 532235888.200000 14.842233 6.277287 6
89 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.140000 446887159.500000 10.431906 -0.375934 4
90 71.139999 15.674795 1.381490 15 11 -59000000 709000000 4.920000 144105691.100000 14.459349 -4.552214 5
91 70.416985 8.924595 1.191466 14 13 -2214800000 3357400000 4.810000 698004158.000000 14.639706 -13.759230 6
92 105.080002 2.049141 1.188454 19 26 848000000 8382000000 4.950000 1693333333.000000 21.228283 -3.985039 1
93 26.680000 2.026769 1.689235 19 25 23000000 1034000000 -2.430000 111522633.700000 20.819876 -76.119077 1
94 25.219999 3.572891 1.812144 19 25 23000000 1034000000 -2.430000 111522633.700000 20.819876 -76.119077 1
95 85.730003 12.109326 1.440884 64 14 -325000000 1450000000 5.080000 285433070.900000 16.875985 -0.662152 1
96 75.620003 15.573900 1.070406 7 39 22239000 296689000 1.560000 190185256.400000 48.474361 -7.489313 9
97 103.930000 -1.188439 1.337924 15 38 46300000 168800000 4.680000 36068376.070000 22.207265 -11.877441 6
98 61.310001 6.979585 1.507569 24 26 -319396000 869829000 5.520000 157577717.400000 11.106884 -2.293440 6
99 93.199997 18.049399 1.150797 35 58 683000000 764000000 4.000000 191000000.000000 23.299999 -12.717277 2
100 71.389999 -0.833448 1.096727 7 8 -1179000000 2816000000 4.050000 695308642.000000 17.627160 -4.426811 11
101 69.709999 -3.622291 1.211643 6 79 533875000 269732000 1.270000 212387401.600000 54.889763 1.962527 5
102 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.550000 406582278.500000 93.089287 1.785616 3
103 27.480000 12.163265 1.409302 26 271 -4496000000 1725000000 1.430000 1206293706.000000 19.216783 4.601699 7
104 114.379997 3.783684 1.078516 15 2 -116800000 1002100000 3.380000 296479289.900000 33.840236 -14.928867 8
105 64.269997 -3.974306 1.068002 9 20 249000000 1193000000 4.070000 293120393.100000 15.791154 -3.022649 11
106 111.370003 14.531063 1.081040 19 15 -35000000 429100000 3.610000 118864265.900000 30.850416 -8.116821 6
107 59.209999 -6.135071 0.927260 10 3 29000000 1117000000 3.130000 356869009.600000 18.916933 -6.369284 11
108 67.510002 3.654238 1.404508 22 14 79000000 848000000 5.710000 148511383.500000 11.823118 -12.308821 8
109 70.790001 -4.078593 1.941104 35 39 -1368707000 -4524515000 -8.290000 545779855.200000 93.089287 1.416243 3
110 302.399994 10.019650 1.308082 7 164 1617921000 187774000 3.250000 57776615.380000 93.046152 23.856728 9
111 81.589996 8.037605 1.056186 8 47 2196000 870120000 2.370000 367139240.500000 34.426159 -1.269332 9
112 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.560000 152091071.400000 93.089287 9.567952 3
113 51.070000 0.709921 1.232829 8 1 -14756000 878485000 2.770000 317142599.300000 18.436823 -1.169833 11
114 239.410004 6.765073 1.118425 4 47 4073000 232120000 3.500000 66320000.000000 68.402858 -3.089477 9
115 29.639999 12.656785 1.452048 5 99 450000000 268000000 0.920000 291304347.800000 32.217390 -0.597313 4
116 52.040001 1.166411 1.521430 13 10 -513000000 1979000000 4.250000 465647058.800000 12.244706 -8.639591 6
117 68.360001 4.910983 1.217401 2 44 -71065000 -156734000 -0.990000 158317171.700000 18.456543 6.174024 11
118 78.980003 11.616734 1.666482 20 257 64600000 494900000 2.300000 215173913.000000 34.339132 6.349747 5
119 27.770000 -6.403775 1.351595 9 74 4624000000 2269000000 2.685000 299887089.100000 17.313076 -1.715880 11
120 45.099998 -4.449159 1.062553 27 94 -119311000 457223000 2.420000 188935124.000000 18.636363 5.991459 6
121 124.300003 4.894517 1.578747 16 29 273599000 764465000 5.870000 130232538.300000 21.175469 -44.103494 1
122 88.209999 13.922251 1.186059 19 39 28136000 394950000 1.580000 249968354.400000 55.829113 -14.151445 9
123 14.090000 2.398256 1.151454 26 43 3515000000 7373000000 1.860000 3963978495.000000 7.575269 5.108756 1
124 40.820000 10.984234 1.411396 29 35 14523000 516361000 1.770000 291729378.500000 23.062147 4.427425 6
125 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.310000 2800763359.000000 79.893133 5.884467 7
126 55.500000 16.817517 1.348297 13 31 46600000 315000000 1.970000 159898477.200000 28.172589 -2.100708 6
127 6.770000 -31.685167 3.796410 155 5 -240000000 -12156000000 -11.310000 1074801061.000000 22.811951 2.935427 8
128 31.730000 1.179844 1.238785 5 2 46000000 578000000 1.370000 421897810.200000 23.160584 -6.072561 11
129 60.599998 -10.553508 1.148295 7 29 194800000 650800000 2.220000 293153153.200000 27.297296 -19.086610 7
130 91.459999 5.235295 0.904487 27 18 -19000000 712000000 3.040000 234210526.300000 30.085526 -7.975730 7
131 28.070000 0.214209 1.761193 15 81 -58589000 241686000 1.730000 139702890.200000 16.225434 4.014713 7
132 47.220001 10.819056 1.774454 14 73 -43239000 412512000 2.850000 144741052.600000 16.568421 14.992623 6
133 42.080002 2.210357 1.781661 16 27 -83906000 267669000 2.010000 133168656.700000 20.935324 6.746760 6
134 39.130001 15.088238 2.175738 26 5 -30900000 489000000 3.660000 133606557.400000 10.691257 5.101546 8
135 146.100006 6.806063 1.239858 13 47 -26905000 210219000 3.040000 69150986.840000 48.059213 -3.973395 9
136 65.989998 55.051683 2.075216 10 190 -355228000 546421000 5.420000 100815682.700000 12.175276 22.563791 7
137 4.670000 -2.301255 2.026818 3 496 254000000 -196000000 -0.290000 675862069.000000 14.518987 10.497704 10
138 137.360001 -0.463767 0.939544 28 22 -1603000000 2965000000 9.230000 321235102.900000 14.881907 4.242998 6
139 27.209999 4.212937 1.390342 17 47 -15576000 1374561000 3.040000 158299350.600000 47.366973 -3.973395 9
140 101.190002 2.689264 1.494060 98 148 2824000000 18108000000 12.370000 1463864188.000000 8.180275 3.159446 5
141 18.280001 6.588927 1.578483 7 163 -809000000 1339000000 1.020000 1312745098.000000 17.921570 3.617610 7
142 34.009998 12.281271 1.344514 24 33 -3857000000 9687000000 6.110000 1585433715.000000 5.566284 -4.892037 1
143 85.889999 4.033432 1.177027 22 5 73901000 705672000 4.650000 151757419.400000 18.470968 7.205242 1
144 37.169998 3.393594 1.665475 14 121 -363198000 456227000 2.390000 190889958.200000 15.552300 5.760057 1
145 32.669998 10.446241 1.522778 8 30 -685000000 307000000 1.140000 269298245.600000 28.657893 3.835896 1
146 202.589996 -5.336199 1.348597 34 16 63492000 768996000 11.690000 65782378.100000 17.330196 12.112879 6
147 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.790000 849367088.600000 93.089287 17.345857 3
148 67.360001 -7.076834 1.583355 27 92 83583000 451838000 3.610000 125162880.900000 18.659280 12.145326 1
149 11.060000 4.143126 1.337793 11 99 -373409000 692957000 0.820000 845069512.200000 13.487805 -0.065057 4
150 67.629997 -12.532337 1.914907 28 13 175000000 2129000000 5.140000 414202334.600000 13.157587 -7.279051 5
151 68.029999 0.044116 1.341731 6 47 -112818000 849073000 2.350000 361307659.600000 28.948936 -3.738047 9
152 34.826959 2.218658 1.282286 6 47 162690000 -559235000 -1.210000 462177686.000000 33.651715 -1.269332 9
153 48.480000 -4.585713 2.398580 16 103 272000000 -3056000000 -10.780000 283487940.600000 28.407929 6.264817 3
154 43.459999 -5.005467 1.147332 10 99 49000000 1682000000 4.050000 415308642.000000 10.730864 -4.327138 4
155 45.389999 -17.247036 1.560372 41 31 -184471000 752207000 3.710000 202751212.900000 12.234501 6.069389 1
156 103.570000 9.320247 1.103449 26 41 -1504000000 4768000000 6.110000 780360065.500000 16.950900 -3.709826 6
157 15.200000 -17.837838 3.400491 7 45 7523000000 2461000000 1.800000 513987730.100000 25.309524 3.954975 7
158 11.840000 2.161759 2.373359 16 18 2300000000 4554000000 1.800000 513987730.100000 25.309524 3.954975 7
159 39.540001 24.496225 1.078455 17 29 13065000 686088000 3.280000 199237804.900000 24.070121 -1.980483 2
160 158.190002 18.317132 1.013922 17 3 -17388000 479058000 5.780000 82882006.920000 27.368512 5.925677 5
161 15.340000 -3.217666 1.594628 8 47 -445000000 558000000 0.220000 2536363636.000000 69.727273 -0.113548 9
162 89.269997 -3.261814 1.188383 51 16 -28325000 512951000 3.280000 199237804.900000 24.070121 -1.980483 2
163 178.509995 -0.145443 1.615206 12 70 636000000 1276000000 8.540000 149414519.900000 20.902810 -8.805281 5
164 137.619995 -5.292136 1.082881 92 24 -790000000 13190000000 13.480000 978486646.900000 10.209198 4.852391 7
165 72.919998 -1.565880 1.469586 228 40 -193542000 192078000 2.070000 92791304.350000 35.227052 -0.981083 5
166 119.639999 14.961083 1.152855 26 25 -296585000 419247000 5.190000 80779768.790000 23.052023 5.047700 8
167 34.450001 14.035095 1.226022 19 162 12747000000 11420000000 2.410000 4738589212.000000 14.294606 3.954975 7
168 37.700001 -0.026513 1.301630 24 27 -831000000 938000000 2.250000 416888888.900000 16.755556 6.123934 8
169 23.280001 21.821035 1.139799 23 20 -157700000 454600000 1.110000 409549549.600000 20.972974 0.265658 1
170 27.010000 -13.067267 1.301382 24 15 2448000 123241000 0.580000 212484482.800000 46.568966 -2.763651 9
171 546.159973 18.733013 1.126009 14 317 114300000 588800000 15.870000 37101449.280000 34.414617 42.607500 5
172 92.680000 12.776831 1.142869 36 130 -900000000 1899000000 5.160000 368023255.800000 17.961240 7.586477 6
173 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.260000 428362831.900000 14.814159 4.218620 4
174 73.360001 2.961405 1.218373 33 1 -395000 427235000 3.690000 115781842.800000 19.880759 2.823845 6
175 41.950001 11.539484 1.732990 7 23 -271788000 302971000 2.420000 125194628.100000 17.334711 6.294943 6
176 27.600000 7.351229 1.841767 14 108 -218700000 633700000 1.620000 391172839.500000 17.037037 2.752236 7
177 66.029999 8.033377 1.130337 10 99 -7341000000 24442000000 6.050000 4040000000.000000 10.914049 -1.886881 4
178 26.459999 8.709938 1.224688 18 47 2212000 894115000 2.010000 444833333.300000 13.164179 -1.269332 9
179 127.300003 17.511309 0.870405 582 10 -170000000 1013000000 2.780000 364388489.200000 45.791368 -2.533011 2
180 14.920000 -47.129693 3.139352 1 7 -86000000 253000000 0.100000 2530000000.000000 149.200000 -1.894071 3
181 42.959999 6.812526 0.889913 29 74 -1649000000 7351000000 1.690000 4349704142.000000 25.420118 -1.466307 2
182 74.669998 -18.438017 2.071639 12 18 -211400000 483500000 4.410000 109637188.200000 16.931972 -2.012091 6
183 42.020000 1.965545 1.204037 30 36 -79600000 329200000 2.310000 142510822.500000 18.190476 2.931005 6
184 48.910000 1.705136 1.569167 14 25 -123369000 802894000 3.870000 207466149.900000 12.638243 -0.307832 1
185 123.639999 14.174899 1.603130 9 42 136400000 436900000 5.030000 178246546.300000 15.900937 -1.294844 5
186 29.629999 4.441304 1.427237 14 12 -27208000 423223000 3.030000 213598256.500000 20.819876 -0.857290 1
187 119.510002 14.539013 1.513434 6 7 -235000000 -240000000 -2.970000 80808080.810000 17.334711 14.280750 6
188 84.260002 0.789478 1.440622 17 54 -205200000 2408400000 2.270000 1060969163.000000 37.118944 -0.651103 5
189 217.149994 5.254227 0.903098 116 8 -356000000 3605000000 11.620000 310240963.900000 18.687607 -10.852854 6
190 31.225001 6.642759 1.115842 10 0 -51100000 388400000 3.360000 115595238.100000 9.293155 -4.606591 11
191 17.389999 -14.292764 1.554235 2 81 -638127000 252111000 0.740000 340690540.500000 23.499999 19.821416 4
192 43.060001 13.855106 1.536290 30 41 301000000 2181000000 3.300000 660909090.900000 13.048485 -5.117194 6
193 54.360001 24.707504 1.457013 34 61 274000000 3433000000 9.710000 353553038.100000 5.598352 -2.245773 10
194 86.900002 2.573183 1.609745 68 46 -107000000 4476000000 9.620000 465280665.300000 9.033264 10.316354 8
195 97.360001 7.496965 1.095876 63 130 610000000 3808000000 3.360000 1133333333.000000 28.976191 3.451765 7
196 90.809998 10.622490 1.177776 12 47 10906000 350745000 4.410000 79534013.610000 20.591836 -3.089477 9
197 80.690002 4.183351 1.169328 10 47 1603000 487562000 3.080000 158299350.600000 26.198053 -3.973395 9
198 67.040001 -1.973970 1.642450 24 3 -8000000 859000000 3.220000 266770186.300000 20.819876 -12.370198 1
199 28.299999 11.637077 1.428359 263 61 85000000 355000000 1.030000 344660194.200000 27.475727 2.219577 6
200 27.170000 30.625006 1.921708 14 54 -78836000 369416000 1.080000 342051851.900000 25.157407 4.535251 1
201 118.139999 19.939085 0.733163 64 260 5607600000 4529300000 4.820000 939688796.700000 24.510373 7.122145 1
202 100.339996 2.345976 1.268800 167 183 537900000 941300000 4.700000 200276595.700000 21.348935 8.615585 4
203 44.840000 6.079962 1.321548 26 17 239000000 7267000000 4.490000 1618485523.000000 9.986637 -12.809506 2
204 48.209999 1.366690 1.138650 8 99 1944000000 5310000000 4.610000 1151843818.000000 10.457700 -1.883912 4
205 189.389999 3.514425 1.492478 13 3 -16185000 615302000 2.590000 237568339.800000 73.123552 -3.980316 1
206 78.949997 12.081196 1.718403 103 136 403700000 653500000 3.280000 199237804.900000 24.070121 6.495755 2
207 85.559998 6.976738 1.032221 24 9 35300000 401600000 3.140000 127898089.200000 27.248407 -1.980483 2
208 136.580002 -10.866015 2.164150 7 46 59758000 288792000 4.310000 67005104.410000 31.689096 3.050887 8
209 55.450001 6.022948 1.034162 25 99 -584000000 1599000000 3.010000 531229235.900000 18.421927 -1.950194 4
210 150.639999 5.927847 0.982698 42 27 -99000000 4833000000 7.720000 626036269.400000 19.512953 2.023844 6
211 49.653332 10.800357 1.585944 11 568 1805094000 546733000 3.710000 1469541779.000000 25.420118 -5.190734 2
212 58.209999 6.885788 0.959008 182 33 -952000000 5241000000 2.670000 1962921348.000000 21.801498 -6.632971 2
213 27.590000 -11.229086 2.830675 10 62 -1098300000 1000400000 2.790000 358566308.200000 9.888889 5.846617 8
214 51.840000 11.507846 1.989371 22 18 -367000000 2852000000 5.290000 539130434.800000 9.799622 5.798226 3
215 52.820000 7.031413 1.278460 10 70 1083000000 4442000000 1.580000 2811392405.000000 33.430380 -4.282931 5
216 12.590000 -20.265991 3.325386 12 71 -1177000000 -2204000000 -3.260000 676073619.600000 93.089287 1.273530 3
217 121.180000 -0.361785 1.380390 7 99 -5317000 1079667000 7.220000 149538365.700000 16.783934 -0.938007 4
218 339.130005 18.942905 1.115376 61 18 13624000 352820000 12.750000 27672156.860000 26.598432 4.403994 5
219 22.450001 -8.591197 2.851180 43 27 -910125000 -2270833000 -13.030000 174277283.200000 28.407929 -1.298006 3
220 54.070000 33.177346 2.299304 9 30 1010500000 847600000 1.580000 1633015873.000000 33.430380 -4.213309 5
221 11.450000 1.868327 2.230827 25 99 151000000 997000000 2.660000 374812030.100000 4.304511 -1.880943 4
222 32.930000 7.298791 2.509437 24 57 -155000000 -2441000000 -6.070000 402141680.400000 93.089287 1.171229 3
223 58.169998 8.810324 1.563258 8 117 -126000000 428000000 2.560000 167187500.000000 22.722655 -11.717383 4
224 103.889999 6.237855 1.023375 12 6 -6000000 2752000000 6.110000 450409165.300000 17.003273 -7.353314 11
225 17.990000 10.844116 2.536050 2 198 379000000 220000000 0.430000 511627907.000000 41.837209 6.971864 8
226 114.379997 11.145654 2.605949 6 65 695722000 122641000 0.290000 422900000.000000 394.413783 -5.700168 7
227 32.560001 -3.296695 2.421529 244 1 -9000000 -3362000000 -21.180000 158734655.300000 93.089287 -0.138596 3
228 46.599998 4.931317 1.198493 13 21 84000000 570000000 1.550000 367741935.500000 30.064515 -12.375526 6
229 33.490002 -12.558739 1.952020 5 49 -1456000000 -769000000 -1.990000 386432160.800000 93.089287 9.582536 3
230 84.589996 9.529966 2.168814 13 49 128000000 1556000000 5.130000 303313840.200000 16.489278 0.926433 6
231 72.089996 5.796884 1.281566 11 67 3394000000 973800000 4.030000 241637717.100000 17.888336 -13.398380 4
232 40.299999 6.585554 1.460619 5 147 915325000 357659000 1.110000 322215315.300000 36.306305 11.168107 8
233 44.080002 9.980039 1.641300 19 14 75400000 350000000 1.300000 269230769.200000 33.907694 -2.075543 1
234 51.630001 8.420836 1.104581 4 47 36442000 283766000 1.090000 260335779.800000 47.366973 -3.973395 9
235 24.660000 -24.123077 3.560178 73 6 -75150000 244977000 1.170000 209382051.300000 21.076923 -8.025110 3
236 75.660004 14.810321 1.066369 45 18 217100000 1093900000 4.430000 246930022.600000 17.079008 -10.464098 1
237 253.419998 0.964143 1.089370 47 4 -134259000 931216000 9.320000 99915879.830000 27.190987 -0.364026 1
238 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.230000 765298142.700000 93.089287 3.345102 3
239 16.150000 3.128991 1.132813 5 99 -298400000 260100000 0.860000 302441860.500000 18.779070 -0.426859 4
240 20.650000 3.821021 1.259611 228 34 -403561000 407943000 2.040000 199972058.800000 10.122549 -0.733207 6
241 47.400002 -9.317004 1.439564 23 110 278800000 1604000000 4.520000 354867256.600000 10.486726 6.290521 6
242 53.189999 0.510206 1.039803 5 6 -28000000 888000000 1.810000 490607734.800000 29.386740 -1.121059 11
243 1274.949951 3.190527 1.268340 29 184 -1671386000 2551360000 50.090000 50935516.070000 25.453183 -1.052429 1
244 38.689999 -8.230553 1.180661 13 11 -8000000 1679000000 3.320000 505722891.600000 11.653614 -0.361858 11
245 99.919998 6.072188 0.805357 45 68 2962000000 5452000000 3.710000 1469541779.000000 26.932614 -5.190734 2
246 32.279999 3.130991 1.238748 11 79 298000000 6960000000 1.130000 6159292035.000000 28.566371 -4.213309 5
247 44.980000 -5.305263 1.528985 13 117 700900000 1234000000 4.110000 300243309.000000 10.944039 -4.526995 4
248 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.280000 491391569.000000 24.070121 -2.256747 2
249 31.799999 3.515625 1.086898 17 99 116000000 1267600000 2.160000 586851851.900000 14.722222 -0.843313 4
250 17.820000 -5.564393 1.694751 10 25 -533785000 494090000 1.380000 358036231.900000 12.913043 -0.307832 1
251 87.910004 10.328820 0.861453 52 22 1735000000 6873000000 4.420000 1554977376.000000 19.889141 -1.418027 2
252 95.309998 6.993709 1.120534 9 99 -295000000 4106000000 7.520000 546010638.300000 12.674202 -0.694126 4
253 49.529999 -3.034462 1.875910 2 8 15900000 -76400000 -0.420000 181904761.900000 14.579221 -6.574869 6
254 64.480003 0.498752 1.143421 10 3 31884000 437257000 3.940000 110978934.000000 16.365483 -6.089228 11
255 98.820000 19.160738 1.533003 28 31 625000000 1406000000 5.180000 271428571.400000 19.077220 -0.593158 8
256 34.130001 3.424245 1.109059 7 22 -563000000 682000000 1.010000 675247524.800000 33.792080 -2.827111 11
257 81.410004 6.585503 1.227467 13 117 2694000000 5642000000 12.370000 456103476.200000 6.581245 -4.178927 4
258 81.800003 5.371643 1.379589 18 41 -2133000000 4227000000 7.780000 543316195.400000 10.514139 7.029056 3
259 20.250000 -16.632362 2.954291 10 11 -61744000 321824000 1.590000 202405031.400000 12.735849 4.291894 6
260 102.400002 0.293834 1.131240 35 8 21000000 1547000000 5.390000 287012987.000000 18.998145 0.574887 8
261 36.200001 17.456201 1.925754 9 25 -808000000 1228000000 1.000000 1228000000.000000 36.200001 5.434039 7
262 56.830002 -23.244191 1.945966 15 4 10853000 304768000 5.750000 53003130.430000 9.883479 -12.020894 6
263 101.209999 13.425973 1.556512 8 9 -67676000 665783000 3.030000 219730363.000000 33.402640 -15.727481 1
264 542.869995 16.995320 1.802345 17 129 160383000 636056000 6.170000 103088492.700000 87.985412 20.409000 5
265 47.139999 -7.659158 1.142370 36 34 -62542000 357796000 2.720000 131542647.100000 17.330882 4.089472 6
266 189.789993 20.432767 1.058807 13 108 168081000 696067000 6.920000 100587716.800000 27.426300 -16.215469 6
267 24.610001 -25.106512 3.712995 26 0 23000 -713685000 -4.290000 166360139.900000 93.089287 0.525090 3
268 43.990002 6.745943 0.839821 10 2 -42800000 749900000 2.140000 350420560.700000 20.556076 -2.428225 6
269 60.490002 7.232764 1.266240 14 9 39000000 746000000 5.220000 142911877.400000 11.588123 -4.016461 11
270 32.930000 15.462833 1.456940 11 99 615000000 1447000000 1.040000 1391346154.000000 31.663462 -0.130090 4
271 23.940001 -9.898378 2.030786 3 6 -2000000 196000000 0.290000 675862069.000000 82.551728 -2.580408 3
272 44.599998 -5.146750 1.580117 43 19 73000000 225400000 1.630000 138282208.600000 27.361962 -2.716908 8
273 259.600006 16.537983 1.426488 121 10 165012000 1053849000 11.380000 92605360.280000 22.811951 2.825366 8
274 112.980003 4.004424 1.091967 4 47 -26010000 284084000 1.020000 278513725.500000 110.764709 -3.089477 9
275 55.209999 12.238260 1.773865 40 23 -654720000 606828000 4.680000 129664102.600000 11.797008 -7.961579 1
276 46.790001 4.372070 0.895059 11 15 694000000 2421000000 2.600000 931153846.200000 17.996154 -2.795456 11
277 194.440002 5.284822 1.135546 48 47 88852000 2139375000 5.880000 363839285.700000 33.068028 -1.269332 9
278 98.580002 14.044424 1.080858 596 51 -1016000000 1156000000 4.260000 271361502.300000 23.140846 -4.178927 4
279 120.599998 -13.906342 1.203816 10 9 33398000 267046000 3.020000 88425827.810000 39.933774 -18.864194 6
280 94.010002 -2.791849 1.126448 11 9 -167000000 1350000000 5.430000 248618784.500000 17.313076 -8.547222 11
281 42.840000 11.970732 1.437938 8 99 -2630000000 1933000000 3.620000 533977900.600000 11.834254 -2.481376 4
282 66.360001 -0.866449 1.444644 9 117 -648000000 1980000000 4.530000 437086092.700000 14.649007 -4.044970 4
283 76.830002 -8.513933 2.017394 25 225 237800000 798300000 4.210000 189619952.500000 18.249407 7.413777 7
284 7.110000 -44.798137 4.580042 200 2 -38000000 -4556000000 -6.070000 402141680.400000 93.089287 1.273530 3
285 30.410000 -2.874478 1.835028 18 99 497000000 2214000000 2.660000 832330827.100000 11.432331 -0.842213 4
286 92.940002 -1.650792 1.138163 17 116 1584000000 1439000000 3.820000 376701570.700000 24.329843 7.026782 5
287 34.410000 5.942118 0.859442 11 11 -3482000000 13345000000 2.370000 5630801688.000000 14.518987 -23.537323 10
288 93.919998 13.129368 1.217803 5 35 -150300000 359500000 1.940000 185309278.400000 48.412370 -25.385129 2
289 26.420000 -8.833678 2.730659 25 88 5000000 -214000000 -1.530000 139869281.000000 74.555557 4.068084 7
290 25.520000 13.624226 1.797269 21 21 10716000 459522000 2.040000 225255882.400000 12.509804 -12.726553 1
291 57.160000 1.168142 1.022968 13 99 -4636000 527100000 4.210000 125201900.200000 13.577197 -1.883912 4
292 141.850006 15.607180 1.247751 9 11 -891400000 1975400000 4.960000 398266129.000000 28.598792 -28.032512 5
293 85.250000 34.803917 1.578344 14 212 159000000 198000000 1.380000 143478260.900000 61.775362 2.627576 1
294 112.860001 13.029548 0.959365 15 99 6000000 3439000000 10.990000 312920837.100000 10.269336 -0.891599 4
295 85.500000 1.303315 1.431109 29 10 12679000 410395000 3.030000 135443894.400000 28.217822 6.010954 1
296 53.330002 23.249369 1.586719 13 19 250000000 1220000000 3.280000 199237804.900000 24.070121 -1.980483 2
297 105.370003 8.584091 1.854132 30 37 -58000000 1540000000 12.500000 123200000.000000 8.429600 4.594156 3
298 49.799999 9.234479 1.579248 20 115 100145000 369041000 1.980000 186384343.400000 25.151515 -2.347391 7
299 54.810001 9.971912 1.263479 30 126 -199000000 2986000000 2.860000 1044055944.000000 19.164336 2.768051 7
300 80.610001 -16.948277 1.758824 14 27 -463323000 232573000 3.030000 213598256.500000 20.819876 -0.857290 1
301 57.299999 8.215294 1.747606 82 42 1004000000 7340000000 19.520000 376024590.200000 2.935451 -23.195292 6
302 37.570000 8.583821 1.157906 12 47 -8482000 340383000 1.300000 261833076.900000 28.900000 -3.089477 9
303 119.489998 -5.136552 2.048697 16 6 29159000 680528000 6.890000 98770391.870000 17.342525 6.255903 5
304 117.639999 1.466273 1.482349 17 70 3428000000 5813000000 6.100000 952950819.700000 19.285246 -8.805281 5
305 33.290001 3.804181 1.102848 10 117 10400000 867100000 3.510000 247037037.000000 9.484331 -4.178927 4
306 78.199997 -12.371135 1.430297 23 43 -195000000 4772000000 5.510000 866061706.000000 14.192377 1.066899 6
307 96.230003 -2.797977 0.826408 196 44 439000000 4844000000 5.380000 900371747.200000 17.886618 1.069558 6
308 96.070000 8.065239 0.949396 28 31 1630000000 7608000000 8.720000 872477064.200000 11.017202 -13.198055 6
309 80.800003 9.248248 1.034843 24 61 -3800000 411500000 4.130000 99636803.870000 19.564165 9.471400 5
310 70.709999 17.341522 1.626934 19 56 425000000 3990000000 8.000000 498750000.000000 8.838750 15.262155 3
311 94.970001 6.028805 1.845710 5 81 142787000 221177000 1.660000 133239156.600000 57.210844 -0.266191 8
312 99.959999 10.027519 1.019724 11 47 637230000 760434000 3.610000 210646537.400000 27.689750 -1.081912 9
313 76.879997 -1.448539 1.454019 37 10 98989000 507577000 3.070000 165334527.700000 25.042344 -13.617440 6
314 87.360001 23.459580 1.379480 35 127 37051000 375236000 3.290000 114053495.400000 26.553192 4.076543 7
315 125.830002 21.928300 2.456535 59 221 89509000 -556334000 -2.310000 240837229.400000 39.602928 2.559671 5
316 56.430000 0.213104 1.444924 4 47 -1803000 419222000 1.260000 332715873.000000 44.785714 -4.040751 9
317 46.220001 6.277303 0.842592 109 14 -6128000000 17879000000 4.380000 4081963470.000000 10.552512 -26.380687 10
318 134.580002 13.925341 1.044615 23 425 65488000 469053000 5.700000 82290000.000000 23.610527 29.547150 5
319 51.310001 -1.986623 1.103033 7 2 -12100000 640300000 2.360000 271313559.300000 21.741526 -1.850995 11
320 54.360001 5.532912 0.969774 12 99 -460000000 22894000000 4.180000 5477033493.000000 13.004785 -0.938007 4
321 146.869995 -0.230971 2.397803 17 10 -254000000 783000000 9.950000 78693467.340000 14.760804 -45.086335 1
322 53.369999 7.061186 0.940366 14 2 -1268000000 753000000 1.660000 453614457.800000 32.150602 -1.415299 6
323 25.700001 -30.988186 3.719560 9 4 -140000000 -571000000 -0.760000 751315789.500000 93.089287 -14.561121 3
324 17.910000 -2.610109 1.273051 60 16 -467300000 837800000 1.630000 513987730.100000 10.987730 -8.043772 7
325 29.980000 8.544529 1.338067 10 116 -568000000 506000000 0.890000 568539325.800000 33.685393 2.284802 9
326 72.650002 1.000976 1.331918 64 9 -12000000 612000000 5.180000 118146718.100000 14.025097 -10.249967 1
327 69.190002 29.496541 3.794783 174 198 -102075000 195290000 1.930000 101186528.500000 35.849742 12.695712 1
328 89.379997 -14.403372 2.397940 86 190 373520000 -2408948000 -25.920000 92937808.640000 93.089287 7.186128 3
329 35.910000 1.383405 1.015052 9 2 5332000 984485000 1.940000 507466494.800000 18.510309 -2.261927 11
330 39.180000 7.696534 0.991011 10 99 734422000 1201560000 4.220000 284729857.800000 9.284360 -7.762677 4
331 77.949997 3.656915 1.370062 9 7 -911000000 16150000000 3.850000 4194805195.000000 20.246752 -2.706443 3
332 60.849998 19.901474 1.007230 11 60 133000000 251200000 1.790000 140335195.500000 33.994412 0.855096 5
333 10.630000 9.474768 1.866680 5 26 -43000000 474000000 0.420000 1128571429.000000 25.309524 -0.295949 7
334 36.500000 11.009729 1.166311 16 83 17000000 340000000 1.880000 180851063.800000 19.414894 4.130471 6
335 33.259998 14.887727 1.845149 15 459 -1032187000 -4359082000 -4.640000 939457327.600000 28.976191 6.261775 7
336 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.970000 435353535.400000 17.682214 -3.838260 1
337 102.589996 9.347683 1.404206 1 100 376000000 147000000 0.780000 188461538.500000 131.525636 -23.884449 5
338 27.299999 -1.158588 1.468176 4 99 -43623000 309471000 1.200000 257892500.000000 22.749999 -0.063096 4
339 47.919998 16.678836 1.610285 32 65 272000000 339000000 0.680000 498529411.800000 70.470585 1.723068 5

Correlation Heatmap¶

In [ ]:
# Plot correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(df_eda_corr.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Scale Data¶

Let's look at all the numerical variables with regard to sector.

To do this, we'll want to first normalize the scale of the various numerical series so we can get an accurate picture of how they vary by sector.

In [ ]:
df_eda.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GICS Sector                   340 non-null    object 
 1   GICS Sub Industry             340 non-null    object 
 2   Current Price                 340 non-null    float64
 3   Price Change                  340 non-null    float64
 4   Volatility                    340 non-null    float64
 5   ROE                           340 non-null    int64  
 6   Cash Ratio                    340 non-null    int64  
 7   Net Cash Flow                 340 non-null    int64  
 8   Net Income                    340 non-null    int64  
 9   Earnings Per Share            340 non-null    float64
 10  Estimated Shares Outstanding  340 non-null    float64
 11  P/E Ratio                     340 non-null    float64
 12  P/B Ratio                     340 non-null    float64
 13  GICS Sector Int               340 non-null    int64  
dtypes: float64(7), int64(5), object(2)
memory usage: 37.3+ KB
In [ ]:
# Normalize the data
categorical_df_eda = df_eda[['GICS Sector', 'GICS Sector Int']] # we want to preserve the integer without scaling
numerical_df_eda = df_eda.drop(columns=['GICS Sector', 'GICS Sub Industry', 'GICS Sector Int'])

# Scale it using min-max method (vs. Z-score), since we want to preserve the relationships between data points
min_max_scaler = MinMaxScaler()
normalized_numerical_data = min_max_scaler.fit_transform(numerical_df_eda)

# Create dataframe for numerical correlation heatmap
normalized_df_eda = pd.DataFrame(normalized_numerical_data, columns=numerical_df_eda.columns)

# Combine back with sector for pairplot that recombines friendly Sector names
final_normalized_df_eda = pd.concat([categorical_df_eda, normalized_df_eda], axis=1)
In [ ]:
final_normalized_df_eda.head(5)
Out[ ]:
GICS Sector GICS Sector Int Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 Industrials 6 0.029793 0.559101 0.247990 0.146288 0.053236 0.331665 0.649114 0.652260 0.104452 0.001491 0.328169
1 Health Care 5 0.043087 0.542850 0.380757 0.140830 0.080376 0.352152 0.597707 0.578219 0.261814 0.030224 0.328335
2 Health Care 5 0.031808 0.571834 0.140499 0.021834 0.069937 0.379895 0.582677 0.576332 0.240842 0.023500 0.369059
3 Information Technology 7 0.070400 0.598024 0.162344 0.008734 0.187891 0.343024 0.503597 0.561236 0.076973 0.136392 0.391448
4 Information Technology 7 0.040002 0.443347 0.251634 0.014192 0.283925 0.360413 0.505001 0.552700 0.362110 0.334251 0.376145
In [ ]:
#Re-plot the correlation heatmap using scaled data

# Convert scaled Numpy array results to dataframe
normalized_numerical_data_df = pd.DataFrame(normalized_numerical_data, columns=numerical_df_eda.columns)

# Plot new results
plt.figure(figsize=(15, 7))
sns.heatmap(normalized_numerical_data_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

OBSERVATIONS

Our revised correlation matrix using scaled data modified the correlations considerably:

Positive correlations:

  • Net Income and Estimated Shares Outstanding (.59)
  • Net Income and Earnings per Share (.56)
  • Earnings per Share and Current Price (.48)
  • P/E Ratio and Current Price (.26)
  • P/E Ratio and Volatility (.26)

Negative correlations:

  • Price Change and Volatility (-.41)
  • Net Income and Volatility (-.38)
  • Net Income and ROE (-.29)
  • Earnings per Share and Volatility (-.38)
  • Earnings per Share and P/E Ratio (-.26)

Pairplot¶

In [ ]:
# Create pairplot, noting the sector values
sns.pairplot(df_eda_corr, diag_kind='kde')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x795d44836860>

OBSERVATIONS

  • The pairplot above clearly shows us that sector is an important feature with data clustered into multiple peaks, even though this isn't apparent in our correlation heatmap.

  • Price Change and Net Income are fairly normally distributed

  • P/E Ratio appears to be bi-modal.

  • GICS Sector (converted to an integer) appears to be tri-modal.

  • As noted earlier, many of the remaining features exhibit skewness (primary right-skewed).

Sector vs. Numerical Series¶

Since we have a major categorical variable (Sector), let's next compare each numerical series to it.

In [ ]:
# Show sector vs. other numeric series using normalized dataframe
for feature in list_numcol:
  plt.figure(figsize=(12, 5))
  sns.boxplot(data=final_normalized_df_eda, x='GICS Sector', y=feature, hue='GICS Sector')
  plt.xticks(rotation=90)
  plt.show()

OBSERVATIONS

For Sector vs. other variables, we can note a few apparent impacts. Energy in particular seems to have the widest range of values for numerous series.

  • Current Price and ROE seem to vary widely by sector.

  • Other series vary moderately by sector. Energy and Telecommunications in particular exhibit wide ranges.

    • Price Change: Energy, Materials, Information Technology, Health Care, Industrials
    • Volatility: Energy, Materials, Telecommunications
    • Cash Ratio: Information Technology, Health Care
    • Net Cash Flow: Telecommunications
    • Net Income: Telecommunications, Consumer Staples, Energy
    • Earnings per Share: Energy
    • Estimated Shares Outstanding: Telecommunications, Consumer Staples, Information Technology, Health Care
    • P/E Ratio: Energy
    • P/B Ratio: Telecommunications Services

Questions to Answer¶

Q1: What does the distribution of stock prices look like?¶
In [ ]:
# Create pivot table for price by sector (counts and averages)
df_eda_sector_pivot = df_eda.pivot_table(
    index='GICS Sector',
    values='Current Price',
    aggfunc=['count', 'mean']  # Calculate both count and mean
)

# Rename the columns for clarity
df_eda_sector_pivot.columns = ['Count', 'Average Current Price']
df_eda_sector_pivot = df_eda_sector_pivot.reset_index()

# Calculate percentages relative to the total count, excluding the total row
total_counts = df_eda_sector_pivot['Count'].sum()
df_eda_sector_pivot['Percentage of Total'] = (df_eda_sector_pivot['Count'] / total_counts) * 100

# Sort the DataFrame by counts in descending order
df_eda_sector_pivot.sort_values(by='Average Current Price', ascending=False, inplace=True)

# Display the combined DataFrame
display(df_eda_sector_pivot)
GICS Sector Count Average Current Price Percentage of Total
4 Health Care 40 132.048250 11.764706
0 Consumer Discretionary 40 128.095404 11.764706
8 Real Estate 27 90.976925 7.941176
7 Materials 20 76.551501 5.882353
5 Industrials 53 74.411807 15.588235
1 Consumer Staples 19 71.972807 5.588235
6 Information Technology 33 63.548485 9.705882
3 Financials 49 58.659183 14.411765
10 Utilities 24 52.968542 7.058824
2 Energy 30 46.042334 8.823529
9 Telecommunications Services 5 32.964000 1.470588

Answer

  • As a whole, the range of Current Price is very wide, and it has a distinctly right-skewed distribution, with a maximum price of 1274.95 and minimum price of 4.50. Half of the stock prices fall below 80.86.

    • Mean: 80.86
    • Median: 59.71
    • Range: 1270.45
    • Standard Deviation: 98.06
  • Looking by sector, the highest average prices belong to the Health Care and Consumer Discretionary sectors, both of which comprise about 11.8% each of our dataset and have prices averaging around 130.

  • The lowest current prices belong to Telecommunications (1.5% of our dataset, at about 32.96) and Energy (8.8%, at 45.04).

  • Consumer discretionary has the highest outliers of any sector.

Q2: The stocks of which economic sector have seen the maximum price increase on average?¶

Earlier, we found that the average price change was just over $4.

To answer this question, let's further compare the price changes by sector.

In [ ]:
# Show prior dataframe
df_stats[df_stats['Feature'] == 'Price Change']
Out[ ]:
Feature Mean Median Min Max Range Standard Deviation Variance P-value
1 Price Change 4.078194 4.819505 -47.129693 55.051683 102.181377 12.006338 144.152149 0.000000
In [ ]:
# Create pivot table for price change by sector
df_eda_sector_price_change_pivot = df_eda.pivot_table(
    index='GICS Sector',
    values='Price Change',
    aggfunc=['mean', 'min', 'max']
)

df_eda_sector_price_change_pivot['Range'] = df_eda_sector_price_change_pivot['max'] - df_eda_sector_price_change_pivot['min']

df_eda_sector_price_change_pivot.columns = ['Mean Price Change', 'Min', 'Max', 'Range']
df_eda_sector_price_change_pivot.sort_values(by='Mean Price Change', ascending=False, inplace=True)
df_eda_sector_price_change_pivot = df_eda_sector_price_change_pivot.reset_index()

df_eda_sector_price_change_pivot
Out[ ]:
GICS Sector Mean Price Change Min Max Range
0 Health Care 9.585652 -12.532337 33.177346 45.709684
1 Consumer Staples 8.684750 -12.017268 24.496225 36.513493
2 Information Technology 7.217476 -23.790903 55.051683 78.842586
3 Telecommunications Services 6.956980 -2.301255 24.707504 27.008759
4 Real Estate 6.205548 -13.067267 15.573900 28.641168
5 Consumer Discretionary 5.846093 -33.131268 34.803917 67.935185
6 Materials 5.589738 -31.685167 37.489677 69.174843
7 Financials 3.865406 -14.292764 15.462833 29.755597
8 Industrials 2.833127 -23.244191 20.432767 43.676958
9 Utilities 0.803657 -8.230553 8.596874 16.827427
10 Energy -10.228289 -47.129693 17.341522 64.471216
In [ ]:
# Visualize price change
plt.figure(figsize=(10, 5))
sns.barplot(data=df_eda.sort_values(by='Price Change', ascending=False), x='GICS Sector', y='Price Change', hue='GICS Sector')
plt.xticks(rotation=90)
plt.show()
In [ ]:
# Visualize volatility
plt.figure(figsize=(10, 5))
sns.barplot(data=df_eda.sort_values(by='Volatility', ascending=False), x='GICS Sector', y='Volatility', hue='GICS Sector')
plt.xticks(rotation=90)
plt.show()

OBSERVATIONS

  • Health Care has the highest average price change out of all sectors. The top three sectors with the highest average price change are:

    • Health Care: \$9.59 (3.5%)
    • Consumer Staples: \$8.68 (2.8%)
    • Information Technology: \$7.22 (7.6%)
  • All sectors except the four listed below exceed the average price change of \$4.09:

    • Financials
    • Industrials
    • Utilities
    • Energy
  • Energy is the only sector with an average negative price change, and also exhibits the second-highest range of values. It also has the most volatility in price.

Q3: How are the different variables correlated with each other?¶

Next, let's drill down by sector into scaled correlation data to see if the relationships differ widely by sector.

In [ ]:
rank_mapping
Out[ ]:
{'Consumer Discretionary': 1,
 'Consumer Staples': 2,
 'Energy': 3,
 'Financials': 4,
 'Health Care': 5,
 'Industrials': 6,
 'Information Technology': 7,
 'Materials': 8,
 'Real Estate': 9,
 'Telecommunications Services': 10,
 'Utilities': 11}
In [ ]:
final_normalized_df_eda
Out[ ]:
GICS Sector GICS Sector Int Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 Industrials 6 0.029793 0.559101 0.247990 0.146288 0.053236 0.331665 0.649114 0.652260 0.104452 0.001491 0.328169
1 Health Care 5 0.043087 0.542850 0.380757 0.140830 0.080376 0.352152 0.597707 0.578219 0.261814 0.030224 0.328335
2 Health Care 5 0.031808 0.571834 0.140499 0.021834 0.069937 0.379895 0.582677 0.576332 0.240842 0.023500 0.369059
3 Information Technology 7 0.070400 0.598024 0.162344 0.008734 0.187891 0.343024 0.503597 0.561236 0.076973 0.136392 0.391448
4 Information Technology 7 0.040002 0.443347 0.251634 0.014192 0.283925 0.360413 0.505001 0.552700 0.362110 0.334251 0.376145
5 Consumer Staples 2 0.025330 0.343628 0.203627 0.009825 0.051148 0.344645 0.529018 0.576781 0.096340 0.017772 0.407517
6 Information Technology 7 0.214152 0.521807 0.099773 0.031659 0.026096 0.353399 0.502909 0.629976 0.006406 0.053523 1.000000
7 Utilities 11 0.030485 0.482516 0.101647 0.008734 0.014614 0.359533 0.503731 0.573277 0.035381 0.026074 0.367474
8 Utilities 11 0.042324 0.484447 0.087167 0.010917 0.009395 0.350991 0.533256 0.578039 0.064294 0.029558 0.356249
9 Financials 4 0.043607 0.490861 0.081919 0.014192 0.103340 0.340923 0.543277 0.602750 0.065743 0.013810 0.361799
10 Financials 4 0.045236 0.543032 0.097170 0.001092 0.103340 0.346522 0.536252 0.565100 0.207406 0.064241 0.349891
11 Real Estate 9 0.027966 0.535404 0.111823 0.015284 0.049061 0.351239 0.495658 0.563573 0.022172 0.044563 0.364794
12 Financials 4 0.059853 0.479808 0.098636 0.002183 0.103340 0.349607 0.493424 0.568605 0.006586 0.068150 0.351132
13 Financials 4 0.028683 0.455296 0.082895 0.009825 0.103340 0.355749 0.497911 0.568515 0.023598 0.032074 0.322947
14 Information Technology 7 0.037884 0.228406 0.169316 0.009825 0.234864 0.352146 0.497173 0.566089 0.024608 0.050092 0.391851
15 Materials 8 0.040545 0.720206 0.322641 0.009825 0.013570 0.279368 0.497455 0.576961 0.013633 0.029847 0.304456
16 Industrials 6 0.059829 0.481459 0.270419 0.037118 0.077244 0.349493 0.508151 0.609309 0.016410 0.017605 0.365548
17 Financials 4 0.045330 0.525751 0.083211 0.010917 0.103340 0.345490 0.535731 0.595921 0.064641 0.017504 0.349891
18 Industrials 6 0.048345 0.595832 0.143137 0.655022 0.046973 0.347717 0.493681 0.564291 0.011174 0.072870 0.366704
19 Health Care 5 0.146602 0.679851 0.335274 0.001092 0.203549 0.352621 0.493475 0.556025 0.030023 0.528618 0.301913
20 Information Technology 7 0.011154 0.723849 0.188950 0.018559 0.136743 0.406700 0.519179 0.560068 0.194225 0.025874 0.389782
21 Industrials 6 0.038640 0.482888 0.092569 0.018559 0.038622 0.350663 0.502790 0.572019 0.034659 0.035896 0.349096
22 Financials 4 0.122209 0.396534 0.353508 0.018559 0.069937 0.350970 0.501230 0.635187 0.004355 0.026469 0.219821
23 Health Care 5 0.124232 0.629206 0.233201 0.026201 0.377871 0.363474 0.635126 0.632132 0.119167 0.028195 0.488009
24 Financials 4 0.080224 0.437546 0.127141 0.022926 0.069937 0.341768 0.523035 0.627190 0.025108 0.017975 0.305681
25 Real Estate 9 0.072769 0.561382 0.112465 0.009825 0.040710 0.350782 0.504755 0.562674 0.074169 0.124431 0.269541
26 Consumer Discretionary 1 0.528466 0.777028 0.189042 0.003275 0.060543 0.392249 0.502898 0.561416 0.071425 1.000000 0.390009
27 Consumer Discretionary 1 0.043418 0.484237 0.194379 0.019651 0.001044 0.350516 0.499700 0.585228 0.013854 0.023320 0.332136
28 Health Care 5 0.106214 0.455167 0.202370 0.010917 0.073069 0.349362 0.543840 0.637344 0.038396 0.021701 0.219863
29 Financials 4 0.069039 0.499504 0.096668 0.024017 0.103340 0.350870 0.519345 0.594213 0.041304 0.030029 0.333161
30 Energy 3 0.031461 0.572780 0.434702 1.000000 0.083507 0.372388 0.000000 0.000000 0.058186 0.171688 0.395206
31 Energy 3 0.034696 0.257656 0.442437 0.055677 0.022965 0.149443 0.350969 0.431485 0.078294 0.171688 0.308300
32 Information Technology 7 0.037569 0.487597 0.071382 0.025109 0.182672 0.374587 0.506389 0.572109 0.045899 0.034679 0.410959
33 Industrials 6 0.002282 0.477362 0.483223 0.002183 0.038622 0.351870 0.483761 0.547129 0.164889 0.029998 0.383846
34 Information Technology 7 0.026927 0.689453 0.299768 0.010917 0.073069 0.255943 0.509068 0.560787 0.115715 0.055334 0.372395
35 Real Estate 9 0.141391 0.508775 0.103905 0.007642 0.049061 0.347149 0.505936 0.599694 0.017322 0.057705 0.355923
36 Information Technology 7 0.110709 0.636441 0.289590 0.030568 0.170146 0.357375 0.518908 0.575613 0.165761 0.029164 0.390255
37 Utilities 11 0.043489 0.545369 0.113953 0.008734 0.004175 0.351245 0.500396 0.573816 0.024671 0.037187 0.347122
38 Financials 4 0.051202 0.400400 0.043387 0.026201 0.103340 0.365382 0.598103 0.584958 0.078118 0.013955 0.368012
39 Industrials 6 0.110268 0.560129 0.109892 0.088428 0.025052 0.337076 0.598374 0.617486 0.107741 0.031026 0.478360
40 Financials 4 0.009705 0.543841 0.178203 0.005459 0.103340 1.000000 0.821680 0.587474 0.133309 0.019176 0.366409
41 Health Care 5 0.026487 0.624694 0.122531 0.010917 0.133612 0.328287 0.510652 0.565909 0.084178 0.035226 0.413074
42 Financials 4 0.026219 0.519368 0.089557 0.007642 0.103340 0.393907 0.533917 0.573187 0.126714 0.022211 0.366825
43 Health Care 5 0.145570 0.476397 0.171898 0.008734 0.085595 0.350256 0.493296 0.566089 0.007755 0.194836 0.350863
44 Energy 3 0.032784 0.340740 0.474772 0.012009 0.087683 0.368823 0.449468 0.509570 0.066934 0.171688 0.436729
45 Health Care 5 0.237593 0.509366 0.284082 0.040393 0.138831 0.355214 0.564415 0.688112 0.033099 0.032343 0.371773
46 Financials 4 0.028903 0.514298 0.121786 0.007642 0.103340 0.337014 0.556306 0.574445 0.184145 0.023164 0.354793
47 Materials 8 0.053705 0.623064 0.169883 0.022926 0.010438 0.351576 0.496329 0.568335 0.017834 0.061974 0.351994
48 Health Care 5 0.050604 0.618619 0.199047 0.010917 0.055324 0.250907 0.523098 0.558361 0.267013 0.133774 0.373846
49 Health Care 5 0.010972 0.576301 0.197199 0.003275 0.013570 0.342174 0.485491 0.548297 0.212033 0.054339 0.352066
50 Consumer Discretionary 1 0.030485 0.495200 0.344592 0.017467 0.026096 0.343673 0.503183 0.574355 0.032044 0.024677 0.376072
51 Real Estate 9 0.096848 0.531733 0.092622 0.009825 0.049061 0.318048 0.502629 0.583970 0.020579 0.058496 0.364794
52 Financials 4 0.037192 0.507376 0.137468 0.007642 0.103340 0.000000 0.849906 0.598526 0.515262 0.012626 0.365290
53 Industrials 6 0.049951 0.495980 0.197664 0.017467 0.026096 0.323001 0.542839 0.581723 0.111216 0.030970 0.401509
54 Financials 4 0.088433 0.590357 0.055027 0.009825 0.103340 0.385587 0.549552 0.628179 0.048552 0.019958 0.285856
55 Real Estate 9 0.023677 0.541463 0.146793 0.020742 0.012526 0.344286 0.501879 0.564651 0.049896 0.034564 0.354335
56 Real Estate 9 0.064505 0.554884 0.059016 0.021834 0.037578 0.350657 0.522180 0.589810 0.051356 0.031490 0.318994
57 Consumer Discretionary 1 0.039340 0.548710 0.159630 0.006550 0.020877 0.383836 0.527100 0.570222 0.122278 0.040317 0.334539
58 Health Care 5 0.090724 0.543920 0.329531 0.028384 0.347599 0.374287 0.523869 0.568065 0.124828 0.107315 0.349926
59 Materials 8 0.028580 0.370704 0.425026 0.016376 0.026096 0.297054 0.504334 0.576602 0.031998 0.020577 0.369062
60 Financials 4 0.017073 0.561779 0.118557 0.003275 0.103340 0.344583 0.507984 0.563842 0.083871 0.026588 0.370673
61 Consumer Staples 2 0.029863 0.471488 0.050915 0.020742 0.039666 0.347648 0.499029 0.578039 0.016871 0.020232 0.325030
62 Energy 3 0.000000 0.088352 0.994742 0.748908 0.022965 0.247873 0.184344 0.348369 0.102262 0.048509 0.362010
63 Industrials 6 0.045275 0.373077 0.117578 0.046943 0.012526 0.351786 0.501099 0.581544 0.019102 0.027964 0.376428
64 Consumer Discretionary 1 0.140580 0.496453 0.250795 0.641921 0.000000 0.350619 0.484824 0.528080 0.013675 0.034059 0.000000
65 Health Care 5 0.111638 0.546206 0.222319 0.017467 0.073069 0.367697 0.534125 0.623326 0.037287 0.028519 0.328066
66 Financials 4 0.043032 0.556922 0.052679 0.009825 0.103340 0.349087 0.503690 0.584689 0.022205 0.023527 0.349891
67 Consumer Staples 2 0.048896 0.508029 0.042192 0.504367 0.028184 0.346835 0.519325 0.563663 0.143013 0.077331 0.368308
68 Financials 4 0.029383 0.479831 0.214327 0.006550 0.103340 0.352934 0.501334 0.576242 0.024487 0.021598 0.370673
69 Financials 4 0.067771 0.437726 0.153419 0.005459 0.103340 0.360769 0.516469 0.583251 0.050304 0.040916 0.085141
70 Consumer Discretionary 1 0.374159 0.136996 0.452533 0.022926 0.247390 0.345194 0.500388 0.687393 0.000557 0.054137 0.454814
71 Industrials 6 0.065733 0.276419 0.192156 0.019651 0.050104 0.332103 0.519637 0.620541 0.024515 0.015734 0.475469
72 Utilities 11 0.024857 0.480317 0.079202 0.014192 0.012526 0.351777 0.501668 0.566987 0.041581 0.030573 0.371291
73 Health Care 5 0.048258 0.673726 0.406962 0.016376 0.073069 0.355248 0.497874 0.576781 0.014850 0.036325 0.364618
74 Utilities 11 0.010910 0.475294 0.170711 0.020742 0.044885 0.349493 0.476048 0.535448 0.065585 0.027381 0.373497
75 Financials 4 0.053272 0.455172 0.164141 0.008734 0.103340 0.374984 0.574901 0.614161 0.087866 0.013635 0.367444
76 Energy 3 0.010382 0.264533 0.603776 0.005459 0.000000 0.349917 0.488099 0.547399 0.061824 0.171688 0.369893
77 Health Care 5 0.102090 0.366578 0.213652 0.007642 0.003132 0.350282 0.494716 0.565999 0.018374 0.059148 0.375148
78 Industrials 6 0.016884 0.418670 0.232151 0.017467 0.077244 0.349274 0.531499 0.567886 0.155967 0.019119 0.375378
79 Telecommunications Services 10 0.016262 0.462794 0.205109 0.005459 0.003132 0.350494 0.508776 0.564112 0.086115 0.024735 0.305755
80 Information Technology 7 0.043701 0.415684 0.157260 0.017467 0.189979 0.354157 0.524319 0.573906 0.094660 0.037219 0.405689
81 Information Technology 7 0.056004 0.549525 0.321222 0.016376 0.054280 0.353946 0.497131 0.567976 0.021400 0.066085 0.362378
82 Consumer Staples 2 0.073415 0.474218 0.196056 0.014192 0.011482 0.349869 0.599646 0.591787 0.178770 0.034365 0.336768
83 Energy 3 0.067268 0.586943 0.264498 0.002183 0.044885 0.295415 0.586095 0.572019 0.299588 0.064051 0.394198
84 Energy 3 0.069550 0.434381 0.509344 0.000000 0.039666 0.357705 0.491847 0.554767 0.015390 0.321894 0.398633
85 Utilities 11 0.049699 0.422201 0.040752 0.015284 0.008351 0.359596 0.530060 0.578758 0.091969 0.034538 0.333916
86 Industrials 6 0.036357 0.592131 0.184840 0.044760 0.032359 0.346929 0.584824 0.600952 0.125441 0.011405 0.289443
87 Materials 8 0.048880 0.828129 0.219585 0.020742 0.062630 0.300200 0.531186 0.569413 0.142267 0.052858 0.386831
88 Industrials 6 0.056492 0.499917 0.212843 0.030568 0.022965 0.362292 0.530915 0.586126 0.082289 0.022675 0.401574
89 Financials 4 0.038663 0.496992 0.110930 0.020742 0.103340 0.422119 0.538357 0.596100 0.068369 0.014276 0.369148
90 Health Care 5 0.052454 0.614637 0.168533 0.015284 0.011482 0.348711 0.505253 0.594123 0.018989 0.021946 0.348794
91 Industrials 6 0.051885 0.548576 0.119136 0.014192 0.013570 0.281284 0.560463 0.593135 0.109324 0.022289 0.303922
92 Consumer Discretionary 1 0.079169 0.481290 0.118353 0.019651 0.027140 0.377080 0.665207 0.594393 0.271651 0.034837 0.351558
93 Consumer Discretionary 1 0.017458 0.481071 0.248532 0.019651 0.026096 0.351276 0.512028 0.528080 0.013675 0.034059 0.000000
94 Consumer Discretionary 1 0.016309 0.496202 0.280482 0.019651 0.026096 0.351276 0.512028 0.528080 0.013675 0.034059 0.000000
95 Consumer Discretionary 1 0.063938 0.579744 0.183973 0.068777 0.014614 0.340392 0.520700 0.595561 0.042038 0.026548 0.367753
96 Real Estate 9 0.055980 0.613650 0.087667 0.006550 0.040710 0.351252 0.496658 0.563932 0.026504 0.086724 0.334480
97 Industrials 6 0.078264 0.449605 0.157208 0.015284 0.039666 0.352005 0.493992 0.591967 0.001369 0.036701 0.313093
98 Industrials 6 0.044716 0.529541 0.201307 0.025109 0.027140 0.340567 0.508606 0.599515 0.021186 0.015562 0.359803
99 Consumer Staples 2 0.069818 0.637876 0.108564 0.037118 0.060543 0.371919 0.506400 0.585857 0.026637 0.038782 0.309000
100 Utilities 11 0.052651 0.453079 0.094509 0.006550 0.008351 0.313681 0.549177 0.586306 0.108884 0.027979 0.349405
101 Health Care 5 0.051328 0.425786 0.124381 0.005459 0.082463 0.367255 0.496096 0.561326 0.030125 0.098941 0.380545
102 Energy 3 0.021646 0.309759 0.569432 0.222707 0.073069 0.376517 0.189160 0.230479 0.061796 0.171688 0.379683
103 Information Technology 7 0.018088 0.580272 0.175763 0.027293 0.282881 0.209934 0.526433 0.562764 0.192220 0.031006 0.393407
104 Materials 8 0.086489 0.498265 0.089775 0.015284 0.002088 0.346904 0.511363 0.580286 0.043839 0.058855 0.298222
105 Utilities 11 0.047046 0.422341 0.087042 0.008734 0.020877 0.358345 0.515343 0.586486 0.043292 0.024482 0.356249
106 Industrials 6 0.084120 0.603444 0.090431 0.019651 0.015658 0.349462 0.499418 0.582352 0.014872 0.053161 0.331421
107 Utilities 11 0.043063 0.401195 0.050456 0.009825 0.003132 0.351464 0.513759 0.578039 0.053688 0.030435 0.339938
108 Materials 8 0.049597 0.496998 0.174517 0.022926 0.014614 0.353028 0.508151 0.601222 0.019708 0.016926 0.310991
109 Energy 3 0.052178 0.421320 0.314005 0.037118 0.040710 0.307747 0.396154 0.475425 0.084498 0.171688 0.377883
110 Real Estate 9 0.234484 0.559293 0.149451 0.006550 0.171190 0.401161 0.494388 0.579118 0.004910 0.171606 0.487250
111 Real Estate 9 0.060679 0.539896 0.083970 0.007642 0.049061 0.350625 0.508612 0.571210 0.055363 0.059970 0.364794
112 Energy 3 0.037491 0.253235 0.424167 0.001092 0.209812 0.366940 0.492249 0.554947 0.020291 0.171688 0.417611
113 Utilities 11 0.036656 0.468183 0.129889 0.007642 0.001044 0.350095 0.508786 0.574805 0.047209 0.029521 0.365279
114 Real Estate 9 0.184903 0.527442 0.100149 0.003275 0.049061 0.350684 0.495312 0.581364 0.006303 0.124675 0.355923
115 Financials 4 0.019788 0.585102 0.186875 0.004367 0.103340 0.364632 0.496060 0.558181 0.042996 0.055764 0.368069
116 Industrials 6 0.037420 0.472651 0.204911 0.013100 0.010438 0.334511 0.531728 0.588103 0.071429 0.017728 0.328874
117 Utilities 11 0.050266 0.509297 0.125878 0.001092 0.045929 0.348334 0.487206 0.541019 0.021307 0.029558 0.401070
118 Health Care 5 0.058625 0.574923 0.242617 0.020742 0.268267 0.352577 0.500790 0.570581 0.030579 0.059805 0.401927
119 Utilities 11 0.018316 0.398565 0.160762 0.008734 0.077244 0.495183 0.537774 0.574041 0.044395 0.027381 0.362618
120 Industrials 6 0.031957 0.417694 0.085625 0.028384 0.098121 0.346825 0.500005 0.571660 0.026300 0.029901 0.400181
121 Consumer Discretionary 1 0.094297 0.509136 0.219810 0.016376 0.030271 0.359114 0.506410 0.602660 0.016726 0.034736 0.156034
122 Real Estate 9 0.065890 0.597486 0.117731 0.019651 0.040710 0.351437 0.498706 0.564112 0.036254 0.100730 0.302011
123 Consumer Discretionary 1 0.007549 0.484706 0.108735 0.027293 0.044885 0.460497 0.644173 0.566628 0.641968 0.008836 0.395879
124 Industrials 6 0.028588 0.568733 0.176307 0.030568 0.036534 0.351011 0.501237 0.565819 0.043065 0.038329 0.392558
125 Information Technology 7 0.078838 0.620015 0.152706 0.007642 1.000000 0.369073 0.566959 0.561686 0.452261 0.146557 0.399659
126 Industrials 6 0.040143 0.625821 0.159905 0.013100 0.032359 0.352014 0.497040 0.567616 0.021565 0.048061 0.360742
127 Materials 8 0.001787 0.151148 0.796294 0.168122 0.005219 0.343050 0.237065 0.448288 0.170775 0.037853 0.385287
128 Utilities 11 0.021433 0.472782 0.131437 0.004367 0.002088 0.351995 0.502522 0.562225 0.064294 0.038516 0.341384
129 Information Technology 7 0.044158 0.357954 0.107914 0.006550 0.030271 0.356650 0.504040 0.569863 0.043297 0.046394 0.277958
130 Information Technology 7 0.068448 0.512471 0.044536 0.028384 0.018789 0.349962 0.505316 0.577231 0.033684 0.051704 0.332109
131 Information Technology 7 0.018552 0.463332 0.267237 0.015284 0.084551 0.348724 0.495511 0.565460 0.018271 0.025309 0.390547
132 Industrials 6 0.033626 0.567117 0.270685 0.014192 0.076200 0.349204 0.499073 0.575523 0.019093 0.025962 0.444049
133 Industrials 6 0.029580 0.482867 0.272558 0.016376 0.028184 0.347932 0.496053 0.567976 0.017205 0.034279 0.403862
134 Materials 8 0.027258 0.608897 0.374999 0.027293 0.005219 0.349590 0.500667 0.582802 0.017277 0.014770 0.395844
135 Real Estate 9 0.111457 0.527843 0.131716 0.013100 0.049061 0.349715 0.494856 0.577231 0.006765 0.085933 0.351615
136 Information Technology 7 0.048400 1.000000 0.348868 0.009825 0.198330 0.339446 0.501864 0.598616 0.011929 0.017596 0.480949
137 Telecommunications Services 10 0.000134 0.438714 0.336287 0.002183 0.517745 0.358501 0.486387 0.547309 0.105713 0.022060 0.422143
138 Industrials 6 0.104577 0.456697 0.053649 0.029476 0.022965 0.300419 0.552283 0.632851 0.047877 0.022751 0.391659
139 Real Estate 9 0.017876 0.502466 0.170834 0.017467 0.049061 0.350070 0.519128 0.577231 0.021304 0.084615 0.351615
140 Health Care 5 0.076107 0.487554 0.197796 0.105895 0.154489 0.438884 0.867959 0.661066 0.234227 0.009988 0.386378
141 Information Technology 7 0.010847 0.525718 0.219742 0.006550 0.170146 0.325253 0.518386 0.559080 0.209581 0.028539 0.388611
142 Consumer Discretionary 1 0.023228 0.581427 0.158921 0.025109 0.034447 0.229920 0.692412 0.604816 0.254054 0.005010 0.347138
143 Consumer Discretionary 1 0.064064 0.500709 0.115383 0.022926 0.005219 0.352868 0.505184 0.591697 0.020237 0.029586 0.406096
144 Consumer Discretionary 1 0.025715 0.494447 0.242355 0.014192 0.126305 0.339197 0.499984 0.571390 0.026619 0.024027 0.399053
145 Consumer Discretionary 1 0.022173 0.563468 0.205261 0.007642 0.031315 0.329132 0.496873 0.560158 0.039407 0.048985 0.389675
146 Industrials 6 0.155921 0.409013 0.159983 0.036026 0.016701 0.352543 0.506504 0.654956 0.006215 0.027413 0.430015
147 Energy 3 0.023252 0.411307 0.320493 0.003275 0.197286 0.594082 0.476485 0.542816 0.134009 0.171688 0.455518
148 Consumer Discretionary 1 0.049479 0.391978 0.221008 0.028384 0.096033 0.353171 0.499892 0.582352 0.015900 0.029944 0.430173
149 Financials 4 0.005164 0.501782 0.157174 0.010917 0.103340 0.338877 0.504919 0.557283 0.133309 0.020096 0.370663
150 Health Care 5 0.049691 0.338588 0.307196 0.029476 0.013570 0.356030 0.534855 0.596100 0.063039 0.019467 0.335504
151 Real Estate 9 0.050006 0.461667 0.158198 0.005459 0.049061 0.347028 0.508173 0.571031 0.054412 0.049540 0.352762
152 Real Estate 9 0.023871 0.482949 0.142745 0.005459 0.049061 0.355645 0.478815 0.539042 0.070863 0.058496 0.364794
153 Energy 3 0.034618 0.416357 0.432927 0.016376 0.107516 0.359064 0.426767 0.453051 0.041721 0.048509 0.401513
154 Financials 4 0.030666 0.412250 0.107664 0.009825 0.103340 0.352089 0.525537 0.586306 0.063219 0.014845 0.349891
155 Consumer Discretionary 1 0.032185 0.292447 0.215034 0.043668 0.032359 0.344787 0.506154 0.583251 0.028553 0.017709 0.400560
156 Industrials 6 0.077980 0.552448 0.096256 0.027293 0.042797 0.303516 0.589869 0.604816 0.122755 0.026691 0.352900
157 Information Technology 7 0.008422 0.286665 0.693375 0.006550 0.046973 0.585856 0.541776 0.566089 0.079313 0.042609 0.390255
158 Information Technology 7 0.005777 0.482392 0.426371 0.016376 0.018789 0.422495 0.585408 0.566089 0.079313 0.042609 0.390255
159 Consumer Staples 2 0.027581 0.700968 0.089759 0.017467 0.030271 0.350965 0.504776 0.579387 0.027980 0.040249 0.361328
160 Health Care 5 0.120973 0.640497 0.072984 0.017467 0.003132 0.350013 0.500460 0.601851 0.009004 0.046530 0.399860
161 Real Estate 9 0.008532 0.429746 0.223939 0.007642 0.049061 0.336638 0.502105 0.551891 0.409140 0.127197 0.370427
162 Consumer Staples 2 0.066724 0.429314 0.118335 0.054585 0.016701 0.349671 0.501166 0.579387 0.027980 0.040249 0.361328
163 Health Care 5 0.136967 0.459812 0.229288 0.012009 0.073069 0.370449 0.517073 0.626651 0.019855 0.034217 0.328066
164 Information Technology 7 0.104782 0.409444 0.090909 0.099345 0.025052 0.325848 0.765437 0.671040 0.155067 0.013852 0.394629
165 Health Care 5 0.053855 0.445911 0.191434 0.247817 0.041754 0.344503 0.494477 0.568515 0.010620 0.061496 0.366199
166 Materials 8 0.090629 0.607653 0.109099 0.027293 0.026096 0.341280 0.499213 0.596550 0.008661 0.038310 0.395581
167 Information Technology 7 0.023574 0.598590 0.128119 0.019651 0.169102 0.749249 0.728539 0.571570 0.768299 0.021632 0.390255
168 Materials 8 0.026132 0.460976 0.147774 0.025109 0.028184 0.324565 0.510027 0.570132 0.063477 0.026319 0.400826
169 Consumer Discretionary 1 0.014782 0.674788 0.105706 0.024017 0.020877 0.345624 0.499950 0.559889 0.062280 0.034350 0.372275
170 Real Estate 9 0.017718 0.333353 0.147709 0.025109 0.015658 0.350633 0.493042 0.555126 0.030141 0.083095 0.357511
171 Health Care 5 0.426353 0.644567 0.102121 0.014192 0.330898 0.354132 0.502748 0.692515 0.001538 0.059948 0.578636
172 Industrials 6 0.069408 0.586276 0.106503 0.038210 0.135699 0.322407 0.530060 0.596280 0.055508 0.028615 0.407954
173 Financials 4 0.022811 0.530402 0.220354 0.012009 0.069937 0.363443 0.510655 0.570222 0.065348 0.022622 0.391540
174 Industrials 6 0.054201 0.490217 0.126131 0.034934 0.001044 0.350544 0.499380 0.583071 0.014370 0.032270 0.384743
175 Industrials 6 0.029478 0.574167 0.259906 0.006550 0.024008 0.342056 0.496789 0.571660 0.015905 0.027422 0.401660
176 Information Technology 7 0.018183 0.533179 0.288183 0.014192 0.112735 0.343716 0.503684 0.564471 0.059283 0.026855 0.384394
177 Financials 4 0.048432 0.539854 0.103246 0.009825 0.103340 0.120950 1.000000 0.604277 0.654367 0.015194 0.361784
178 Real Estate 9 0.017285 0.546476 0.127772 0.018559 0.049061 0.350626 0.509112 0.567976 0.068034 0.019479 0.364794
179 Consumer Staples 2 0.096659 0.632610 0.035676 0.634279 0.010438 0.345240 0.511591 0.574894 0.054915 0.081614 0.358635
180 Energy 3 0.008202 0.000000 0.625491 0.000000 0.007307 0.347867 0.495747 0.550813 0.408102 0.278544 0.361749
181 Consumer Staples 2 0.030273 0.527907 0.040747 0.030568 0.077244 0.298980 0.643715 0.565100 0.704876 0.042819 0.363834
182 Industrials 6 0.055232 0.280792 0.347938 0.012009 0.018789 0.343945 0.500552 0.589541 0.013368 0.026655 0.361174
183 Industrials 6 0.029533 0.480471 0.122404 0.031659 0.037578 0.348067 0.497336 0.570671 0.018729 0.029051 0.385265
184 Consumer Discretionary 1 0.034956 0.477923 0.217320 0.014192 0.026096 0.346698 0.507211 0.584689 0.029322 0.018478 0.369480
185 Health Care 5 0.093778 0.599959 0.226149 0.008734 0.043841 0.354823 0.499581 0.595112 0.024557 0.024691 0.364670
186 Consumer Discretionary 1 0.019780 0.504701 0.180425 0.014192 0.012526 0.349706 0.499296 0.577141 0.030323 0.034059 0.366802
187 Industrials 6 0.090527 0.603522 0.202832 0.005459 0.007307 0.343207 0.485470 0.523228 0.008666 0.027422 0.440580
188 Health Care 5 0.062781 0.468962 0.183905 0.017467 0.056367 0.344139 0.540680 0.570312 0.168519 0.065099 0.367807
189 Industrials 6 0.167382 0.512656 0.044175 0.125546 0.008351 0.339422 0.565624 0.654327 0.046084 0.029998 0.318087
190 Utilities 11 0.021036 0.526245 0.099478 0.009825 0.000000 0.348958 0.498570 0.580106 0.014339 0.012108 0.348529
191 Financials 4 0.010146 0.321359 0.213439 0.001092 0.084551 0.330598 0.495729 0.556564 0.051050 0.039163 0.467583
192 Industrials 6 0.030351 0.596829 0.208774 0.031659 0.042797 0.359971 0.535939 0.579567 0.103274 0.019259 0.346041
193 Telecommunications Services 10 0.039246 0.703036 0.188166 0.036026 0.063674 0.359127 0.562039 0.637164 0.053148 0.005071 0.360035
194 Materials 8 0.064859 0.486418 0.227868 0.073144 0.048017 0.347210 0.583782 0.636355 0.071369 0.011613 0.421259
195 Information Technology 7 0.073092 0.534605 0.094288 0.067686 0.135699 0.369636 0.569856 0.580106 0.180321 0.049592 0.387803
196 Real Estate 9 0.067937 0.565193 0.115578 0.012009 0.049061 0.350898 0.497785 0.589541 0.008458 0.033625 0.355923
197 Real Estate 9 0.059971 0.502176 0.113382 0.009825 0.049061 0.350607 0.500637 0.577590 0.021304 0.044301 0.351615
198 Consumer Discretionary 1 0.049227 0.441917 0.236370 0.025109 0.003132 0.350307 0.508380 0.578848 0.038994 0.034059 0.310692
199 Industrials 6 0.018734 0.575122 0.180717 0.286026 0.063674 0.353215 0.497874 0.559170 0.051697 0.046734 0.381798
200 Consumer Discretionary 1 0.017844 0.760948 0.308963 0.014192 0.056367 0.348091 0.498174 0.559619 0.051272 0.042319 0.393084
201 Consumer Discretionary 1 0.089449 0.656370 0.000000 0.068777 0.271399 0.525948 0.584893 0.593225 0.148740 0.041087 0.405691
202 Financials 4 0.075438 0.484195 0.139239 0.181223 0.191023 0.367381 0.510096 0.592147 0.028150 0.035066 0.412970
203 Consumer Staples 2 0.031753 0.520737 0.152951 0.027293 0.017745 0.358032 0.641964 0.590260 0.259444 0.013428 0.308551
204 Financials 4 0.034405 0.474611 0.105407 0.007642 0.103340 0.411360 0.601167 0.591338 0.183340 0.014325 0.361799
205 Consumer Discretionary 1 0.145531 0.495630 0.197385 0.013100 0.003132 0.350051 0.503300 0.573187 0.034232 0.133665 0.351581
206 Consumer Staples 2 0.058601 0.579469 0.256114 0.111354 0.141962 0.363183 0.504096 0.579387 0.027980 0.040249 0.402638
207 Consumer Staples 2 0.063804 0.529514 0.077740 0.025109 0.009395 0.351661 0.498845 0.578129 0.016346 0.046301 0.361328
208 Materials 8 0.103963 0.354895 0.371986 0.006550 0.048017 0.352426 0.496493 0.588642 0.006415 0.054758 0.385849
209 Financials 4 0.040104 0.520179 0.078245 0.026201 0.103340 0.332291 0.523807 0.576961 0.082125 0.029492 0.361476
210 Industrials 6 0.115030 0.519249 0.064867 0.044760 0.028184 0.347460 0.591224 0.619283 0.097587 0.031570 0.380844
211 Consumer Staples 2 0.035541 0.566934 0.221681 0.010917 0.592902 0.407015 0.501871 0.583251 0.235153 0.042819 0.345682
212 Consumer Staples 2 0.042276 0.528624 0.058709 0.197598 0.034447 0.320781 0.599729 0.573906 0.315618 0.035928 0.338653
213 Materials 8 0.018175 0.351342 0.545250 0.009825 0.064718 0.316205 0.511328 0.574984 0.053965 0.013242 0.399475
214 Energy 3 0.037262 0.573857 0.326553 0.022926 0.018789 0.339078 0.549927 0.597448 0.083413 0.013072 0.399239
215 Health Care 5 0.038034 0.530049 0.141750 0.009825 0.073069 0.384430 0.583073 0.564112 0.453994 0.058074 0.350107
216 Energy 3 0.006368 0.262902 0.673851 0.012009 0.074113 0.313743 0.444528 0.520622 0.105747 0.171688 0.377187
217 Financials 4 0.091841 0.457695 0.168247 0.006550 0.103340 0.350390 0.512980 0.614790 0.019875 0.026373 0.366409
218 Health Care 5 0.263395 0.646621 0.099357 0.065502 0.018789 0.350983 0.497828 0.664480 0.000000 0.045063 0.392444
219 Energy 3 0.014129 0.377158 0.550581 0.045852 0.028184 0.322090 0.443135 0.432833 0.023910 0.048509 0.364654
220 Health Care 5 0.039018 0.785926 0.407120 0.008734 0.031315 0.382163 0.508143 0.564112 0.261814 0.058074 0.350446
221 Financials 4 0.005471 0.479520 0.389319 0.026201 0.103340 0.355280 0.511257 0.573816 0.056615 0.002607 0.361813
222 Energy 3 0.022378 0.532665 0.461744 0.025109 0.059499 0.345709 0.439587 0.495372 0.061072 0.171688 0.376688
223 Financials 4 0.042245 0.547458 0.215784 0.007642 0.122129 0.346616 0.499395 0.572918 0.022753 0.037682 0.313873
224 Utilities 11 0.078232 0.522283 0.075441 0.012009 0.006263 0.350369 0.547842 0.604816 0.068944 0.026791 0.335142
225 Materials 8 0.010618 0.567362 0.468662 0.001092 0.206681 0.362411 0.495059 0.553778 0.078928 0.074084 0.404959
226 Information Technology 7 0.086489 0.570313 0.486833 0.005459 0.067850 0.372317 0.493030 0.552520 0.064457 0.745526 0.343199
227 Energy 3 0.022087 0.428972 0.438892 0.265284 0.001044 0.350275 0.420388 0.359601 0.021375 0.171688 0.370305
228 Industrials 6 0.033138 0.509496 0.120963 0.013100 0.021921 0.353184 0.502356 0.563842 0.055462 0.051664 0.310666
229 Energy 3 0.022819 0.338329 0.316843 0.004367 0.051148 0.305017 0.474442 0.532033 0.058510 0.171688 0.417682
230 Industrials 6 0.063041 0.554501 0.373199 0.013100 0.051148 0.354560 0.522910 0.596010 0.044954 0.025812 0.375495
231 Financials 4 0.053202 0.517967 0.142558 0.010917 0.069937 0.456712 0.510773 0.586126 0.034895 0.028476 0.305681
232 Materials 8 0.028179 0.525685 0.189103 0.004367 0.153445 0.379186 0.497929 0.559889 0.048037 0.063551 0.425410
233 Consumer Discretionary 1 0.031154 0.558905 0.236071 0.019651 0.014614 0.352915 0.497769 0.561596 0.039396 0.058983 0.360865
234 Real Estate 9 0.037097 0.543646 0.096551 0.003275 0.049061 0.351697 0.496389 0.559709 0.037945 0.084615 0.351615
235 Energy 3 0.015868 0.225155 0.734885 0.078603 0.006263 0.348206 0.495580 0.560428 0.029635 0.034548 0.331868
236 Consumer Discretionary 1 0.056012 0.606177 0.086617 0.048035 0.018789 0.357347 0.513277 0.589721 0.035759 0.026935 0.319982
237 Consumer Discretionary 1 0.195931 0.470671 0.092596 0.050218 0.004175 0.346357 0.509886 0.633660 0.011782 0.046192 0.369206
238 Energy 3 0.049675 0.469704 0.222611 0.033843 0.066806 0.332166 0.327267 0.457993 0.120299 0.171688 0.387283
239 Financials 4 0.009170 0.491858 0.103889 0.004367 0.103340 0.341224 0.495895 0.557642 0.044812 0.030172 0.368900
240 Industrials 6 0.012712 0.498630 0.136851 0.247817 0.035491 0.337934 0.498977 0.568245 0.028100 0.013687 0.367407
241 Industrials 6 0.033768 0.370055 0.183630 0.024017 0.114823 0.359277 0.523911 0.590529 0.053362 0.014381 0.401638
242 Utilities 11 0.038325 0.466229 0.079711 0.004367 0.006263 0.349681 0.508985 0.566178 0.075500 0.050373 0.365517
243 Consumer Discretionary 1 1.000000 0.492460 0.139120 0.030568 0.192067 0.298280 0.543660 1.000000 0.003794 0.042882 0.365851
244 Utilities 11 0.026912 0.380687 0.116328 0.013100 0.011482 0.350307 0.525474 0.579747 0.077965 0.016603 0.369217
245 Consumer Staples 2 0.075107 0.520661 0.018767 0.048035 0.070981 0.443200 0.604128 0.583251 0.235153 0.045700 0.345682
246 Health Care 5 0.021866 0.491877 0.131427 0.010917 0.082463 0.359877 0.635564 0.560068 1.000000 0.048811 0.350446
247 Financials 4 0.031863 0.409316 0.206875 0.013100 0.122129 0.372479 0.516198 0.586845 0.044453 0.015251 0.348917
248 Consumer Staples 2 0.058963 0.565565 0.018949 0.017467 0.134656 0.355573 0.503733 0.579387 0.075628 0.040249 0.359982
249 Financials 4 0.021488 0.495641 0.091954 0.017467 0.103340 0.354185 0.516898 0.569323 0.091196 0.022447 0.366870
250 Consumer Discretionary 1 0.010484 0.406780 0.249966 0.009825 0.026096 0.333861 0.500773 0.562315 0.053879 0.019001 0.369480
251 Consumer Staples 2 0.065654 0.562319 0.033349 0.055677 0.022965 0.404823 0.633750 0.589631 0.249087 0.032286 0.364069
252 Financials 4 0.071479 0.529680 0.100697 0.008734 0.103340 0.341330 0.576068 0.617486 0.084535 0.018546 0.367597
253 Industrials 6 0.035444 0.431539 0.297058 0.001092 0.008351 0.351054 0.488881 0.546141 0.025154 0.022174 0.338936
254 Utilities 11 0.047212 0.466117 0.106647 0.009825 0.003132 0.351554 0.499588 0.585318 0.013586 0.025576 0.341303
255 Materials 8 0.074241 0.648753 0.207919 0.029476 0.032359 0.370105 0.519783 0.596460 0.039754 0.030740 0.368089
256 Utilities 11 0.023322 0.494747 0.097714 0.006550 0.022965 0.332948 0.504690 0.558990 0.105612 0.058763 0.357202
257 Financials 4 0.060538 0.525685 0.128495 0.013100 0.122129 0.434818 0.608088 0.661066 0.069872 0.006943 0.350613
258 Energy 3 0.060845 0.513805 0.168039 0.018559 0.042797 0.283842 0.578591 0.619822 0.084096 0.014433 0.405238
259 Industrials 6 0.012397 0.298463 0.577385 0.009825 0.011482 0.348626 0.497182 0.564202 0.028497 0.018664 0.391898
260 Materials 8 0.077059 0.464111 0.103480 0.037118 0.008351 0.351214 0.522723 0.598347 0.042296 0.030590 0.373782
261 Information Technology 7 0.024952 0.632071 0.310015 0.008734 0.026096 0.325285 0.516073 0.558900 0.195760 0.063349 0.397464
262 Industrials 6 0.041190 0.233756 0.315269 0.015284 0.004175 0.350896 0.496827 0.601581 0.004131 0.013232 0.312394
263 Consumer Discretionary 1 0.076123 0.592629 0.214030 0.007642 0.009395 0.348440 0.504352 0.577141 0.031323 0.058021 0.294329
264 Health Care 5 0.423763 0.627561 0.277935 0.017467 0.134656 0.355573 0.503733 0.605355 0.012300 0.161968 0.470447
265 Industrials 6 0.033563 0.386279 0.106374 0.038210 0.035491 0.348601 0.497932 0.574355 0.016940 0.027414 0.390911
266 Industrials 6 0.145846 0.661201 0.084651 0.013100 0.112735 0.355814 0.504984 0.612095 0.011892 0.046640 0.291951
267 Energy 3 0.015829 0.215530 0.774610 0.027293 0.000000 0.350557 0.475595 0.511367 0.022618 0.171688 0.373539
268 Industrials 6 0.031083 0.527255 0.027726 0.009825 0.002088 0.349218 0.506106 0.569144 0.052637 0.033556 0.359146
269 Utilities 11 0.044071 0.532019 0.138574 0.014192 0.009395 0.351777 0.506025 0.596819 0.018794 0.016478 0.351405
270 Financials 4 0.022378 0.612563 0.188147 0.010917 0.103340 0.369792 0.520638 0.559260 0.222400 0.054709 0.370346
271 Energy 3 0.015302 0.364365 0.337318 0.002183 0.006263 0.350494 0.494559 0.552520 0.105713 0.151620 0.358404
272 Materials 8 0.031564 0.410867 0.220167 0.045852 0.019833 0.352840 0.495172 0.564561 0.018039 0.046518 0.357739
273 Materials 8 0.200795 0.623085 0.180230 0.131004 0.010438 0.355718 0.512442 0.652170 0.010590 0.037853 0.384750
274 Real Estate 9 0.085387 0.500425 0.093271 0.003275 0.049061 0.349743 0.496395 0.559080 0.040910 0.205349 0.355923
275 Consumer Discretionary 1 0.039915 0.581006 0.270532 0.042576 0.024008 0.330079 0.503123 0.591967 0.016634 0.016876 0.332178
276 Utilities 11 0.033287 0.504023 0.042085 0.010917 0.015658 0.372263 0.540942 0.573277 0.147348 0.028681 0.357356
277 Real Estate 9 0.149506 0.512956 0.104600 0.051310 0.049061 0.353336 0.535071 0.602750 0.054825 0.057384 0.364794
278 Financials 4 0.074053 0.598682 0.090384 0.649563 0.053236 0.318779 0.514572 0.588193 0.039743 0.038479 0.350613
279 Industrials 6 0.091385 0.325141 0.122347 0.009825 0.009395 0.351601 0.496040 0.577051 0.009908 0.070459 0.279042
280 Utilities 11 0.070455 0.433913 0.102235 0.010917 0.009395 0.345333 0.518616 0.598706 0.036034 0.027381 0.329324
281 Financials 4 0.030178 0.578387 0.183207 0.007642 0.103340 0.268297 0.530769 0.582442 0.082573 0.016947 0.358887
282 Financials 4 0.048691 0.452756 0.184950 0.008734 0.122129 0.330289 0.531749 0.590619 0.066771 0.022307 0.351266
283 Information Technology 7 0.056933 0.377914 0.333837 0.026201 0.234864 0.357994 0.507115 0.587744 0.026412 0.029164 0.407113
284 Energy 3 0.002054 0.022818 1.000000 0.217249 0.002088 0.349368 0.395497 0.495372 0.061072 0.171688 0.377187
285 Financials 4 0.020394 0.433105 0.286431 0.018559 0.103340 0.366102 0.536627 0.573816 0.131231 0.016181 0.366876
286 Health Care 5 0.069613 0.445080 0.105280 0.017467 0.121086 0.400100 0.520471 0.584239 0.056923 0.040743 0.405227
287 Telecommunications Services 10 0.023543 0.519388 0.032826 0.010917 0.011482 0.241649 0.768668 0.571210 0.913809 0.022060 0.256267
288 Consumer Staples 2 0.070385 0.589726 0.125983 0.004367 0.036534 0.345856 0.497967 0.567347 0.025709 0.086606 0.247261
289 Information Technology 7 0.017254 0.374785 0.519251 0.026201 0.091858 0.350713 0.486012 0.536167 0.018298 0.136392 0.390807
290 Consumer Discretionary 1 0.016545 0.594569 0.276615 0.021834 0.021921 0.350892 0.500053 0.568245 0.032224 0.018233 0.308955
291 Financials 4 0.041450 0.472668 0.075335 0.013100 0.103340 0.350412 0.501461 0.587744 0.015906 0.020266 0.361799
292 Health Care 5 0.108111 0.613976 0.133768 0.008734 0.011482 0.322676 0.531653 0.594483 0.060440 0.048873 0.234359
293 Consumer Discretionary 1 0.063560 0.801845 0.219706 0.014192 0.221294 0.355530 0.494601 0.562315 0.018887 0.112054 0.383786
294 Financials 4 0.085293 0.588750 0.058801 0.015284 0.103340 0.350744 0.562164 0.648666 0.046521 0.013967 0.366635
295 Consumer Discretionary 1 0.063757 0.473991 0.181432 0.030568 0.010438 0.350953 0.499028 0.577141 0.017576 0.048147 0.400276
296 Consumer Staples 2 0.038435 0.688766 0.221883 0.013100 0.019833 0.358376 0.515906 0.579387 0.027980 0.040249 0.361328
297 Energy 3 0.079397 0.545244 0.291397 0.031659 0.038622 0.348743 0.522577 0.662234 0.015580 0.010463 0.393371
298 Information Technology 7 0.035657 0.551609 0.219941 0.020742 0.120042 0.353689 0.498166 0.567706 0.025884 0.042308 0.359540
299 Information Technology 7 0.039600 0.558826 0.137856 0.031659 0.131524 0.344333 0.552720 0.575613 0.165761 0.030906 0.384471
300 Consumer Discretionary 1 0.059908 0.295371 0.266622 0.014192 0.028184 0.336065 0.495322 0.577141 0.030323 0.034059 0.366802
301 Industrials 6 0.041560 0.541635 0.263705 0.088428 0.043841 0.381959 0.643486 0.725312 0.056812 0.000000 0.257934
302 Real Estate 9 0.026030 0.545241 0.110412 0.012009 0.049061 0.350291 0.497569 0.561596 0.038189 0.049447 0.355923
303 Health Care 5 0.090511 0.410967 0.341974 0.016376 0.006263 0.351469 0.504660 0.611825 0.011595 0.027437 0.401469
304 Health Care 5 0.089055 0.475585 0.194752 0.017467 0.073069 0.457776 0.611653 0.604726 0.150903 0.031136 0.328066
305 Financials 4 0.022661 0.498465 0.096100 0.009825 0.122129 0.350882 0.508549 0.581454 0.035776 0.012472 0.350613
306 Industrials 6 0.058011 0.340165 0.181221 0.024017 0.044885 0.344458 0.589952 0.599425 0.136732 0.021438 0.376180
307 Industrials 6 0.072203 0.433853 0.024239 0.212882 0.045929 0.364288 0.591453 0.598257 0.142328 0.028473 0.376193
308 Industrials 6 0.072077 0.540166 0.056210 0.029476 0.032359 0.401539 0.649072 0.628268 0.137778 0.015391 0.306657
309 Health Care 5 0.060057 0.551744 0.078422 0.025109 0.063674 0.350438 0.499051 0.587025 0.011737 0.031667 0.417141
310 Energy 3 0.052115 0.630949 0.232337 0.019651 0.058455 0.363850 0.573650 0.621799 0.076828 0.011242 0.445363
311 Materials 8 0.071211 0.520237 0.289208 0.004367 0.084551 0.355023 0.495084 0.564831 0.017217 0.103361 0.369683
312 Real Estate 9 0.075139 0.559370 0.074492 0.010917 0.049061 0.370488 0.506325 0.582352 0.029841 0.047142 0.365707
313 Industrials 6 0.056972 0.447059 0.187387 0.039301 0.010438 0.353653 0.501054 0.577500 0.022451 0.042100 0.304613
314 Information Technology 7 0.065221 0.690823 0.168011 0.037118 0.132568 0.351716 0.498296 0.579477 0.014088 0.044977 0.390848
315 Health Care 5 0.095502 0.675837 0.447992 0.063319 0.230689 0.353356 0.478876 0.529158 0.034765 0.069829 0.383455
316 Real Estate 9 0.040875 0.463321 0.185023 0.003275 0.049061 0.350500 0.499212 0.561236 0.049749 0.079699 0.351287
317 Telecommunications Services 10 0.032839 0.522669 0.028446 0.117904 0.014614 0.158889 0.863185 0.589271 0.661210 0.014506 0.242409
318 Health Care 5 0.102389 0.597516 0.080962 0.024017 0.443633 0.352605 0.500251 0.601132 0.008908 0.039373 0.514984
319 Utilities 11 0.036845 0.441794 0.096148 0.006550 0.002088 0.350178 0.503821 0.571120 0.039735 0.035814 0.361959
320 Financials 4 0.039246 0.515384 0.061507 0.012009 0.103340 0.336169 0.967730 0.587474 0.888731 0.019176 0.366409
321 Consumer Discretionary 1 0.112063 0.458975 0.432725 0.017467 0.010438 0.342612 0.506796 0.639321 0.008321 0.022520 0.151244
322 Industrials 6 0.038467 0.530340 0.053863 0.014192 0.002088 0.310897 0.506171 0.564831 0.069467 0.055637 0.364082
323 Energy 3 0.016687 0.157969 0.776317 0.008734 0.004175 0.346178 0.478570 0.543086 0.118018 0.171688 0.300014
324 Information Technology 7 0.010555 0.435692 0.140344 0.064410 0.016701 0.335941 0.507938 0.564561 0.079313 0.015335 0.331777
325 Real Estate 9 0.020056 0.544857 0.157245 0.009825 0.121086 0.332791 0.501021 0.557912 0.088210 0.058560 0.382116
326 Consumer Discretionary 1 0.053642 0.471032 0.155647 0.068777 0.009395 0.350181 0.503231 0.596460 0.014755 0.021119 0.321025
327 Consumer Discretionary 1 0.050919 0.749904 0.795871 0.188865 0.206681 0.347364 0.494544 0.567257 0.011989 0.062682 0.432855
328 Energy 3 0.066811 0.320277 0.432761 0.092795 0.198330 0.362239 0.440255 0.317010 0.010644 0.171688 0.406003
329 Utilities 11 0.024724 0.474774 0.073277 0.008734 0.002088 0.350724 0.510996 0.567347 0.078249 0.029661 0.359956
330 Financials 4 0.027297 0.536558 0.067028 0.009825 0.103340 0.373528 0.515521 0.587834 0.041923 0.012091 0.333147
331 Energy 3 0.057814 0.497024 0.165562 0.008734 0.007307 0.322063 0.827142 0.584509 0.679614 0.032967 0.357790
332 Health Care 5 0.044354 0.656002 0.071244 0.010917 0.062630 0.354717 0.495710 0.565999 0.018374 0.059148 0.375148
333 Information Technology 7 0.004825 0.553961 0.294659 0.004367 0.027140 0.349212 0.500354 0.553689 0.179545 0.042609 0.369538
334 Industrials 6 0.025188 0.568983 0.112597 0.016376 0.086639 0.351088 0.497561 0.566807 0.024982 0.031383 0.391111
335 Information Technology 7 0.022638 0.606935 0.289062 0.015284 0.479123 0.318273 0.399602 0.508222 0.148702 0.049592 0.401498
336 Consumer Discretionary 1 0.037795 0.376104 0.193849 0.153930 0.028184 0.355530 0.517428 0.576602 0.066488 0.028084 0.352274
337 Health Care 5 0.077209 0.552717 0.174438 0.000000 0.104384 0.362317 0.493538 0.556923 0.026223 0.244885 0.254575
338 Financials 4 0.017946 0.449897 0.191067 0.003275 0.103340 0.349192 0.496925 0.560697 0.037546 0.037735 0.370673
339 Health Care 5 0.034177 0.624463 0.228009 0.033843 0.067850 0.359064 0.497540 0.556025 0.076792 0.128613 0.379378
In [ ]:
# Plot new results for each sector, using only the top variables and our dictionary from earlier

#normalized_numerical_data_df is the scaled dataframe we created earlier as part of final_normalized_df_eda

cols_to_include = ['GICS Sector Int', 'Net Income', 'Estimated Shares Outstanding',
                   'Earnings Per Share', 'Current Price', 'P/E Ratio',
                   'P/B Ratio', 'Volatility', 'ROE']

# Create an empty DataFrame to store correlations
all_correlations = []

# Plot new results by sector
for friendly_name, rank in rank_mapping.items():
    # Filter the DataFrame for the current sector
    df_temp_corr = final_normalized_df_eda[cols_to_include]
    df_temp_corr = df_temp_corr[df_temp_corr['GICS Sector Int'] == rank]

    # Calculate the correlation matrix
    corr_matrix = df_temp_corr.corr()

    # Store the correlation matrix in a list with sector information
    corr_df = corr_matrix.stack().reset_index()
    corr_df.columns = ['Variable1', 'Variable2', 'Correlation']
    corr_df['Sector'] = friendly_name  # Add the sector name

    # Append to the list
    all_correlations.append(corr_df)

    # Plot the heatmap
    plt.figure(figsize=(15, 7))
    sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
    plt.title(f'{friendly_name}: {rank}')
    plt.show()

# Concatenate all correlation DataFrames into one
all_correlations_df = pd.concat(all_correlations, ignore_index=True)

# Filter for correlations greater than 0.75 (and not equal to 1)
high_corrs = all_correlations_df[(all_correlations_df['Correlation'] > 0.75) & (all_correlations_df['Variable1'] != all_correlations_df['Variable2'])]

ANSWER

In our initial correlation observations earlier, we visualized the correlation of the variables in both their native and scaled form. The scaled version normalizes the values of each feature and gives us a better picture of correlations.

Summary of prior results, for a correlation scale of 0 to 1:

  • Only a few of the variables show significant correlations (above .25 or below -.25).

  • For the scaled data, it's worth nothing that Net Income is present in both the top 2 positive and negative correlations. We can infer that Net Income has a major effect on other aspects of a stock ticker.

  • These series all show a positive correlation of greater than .25:

    • Net Income and Estimated Shares Outstanding (.59)
    • Net Income and Earnings per Share (.56)
    • Earnings per Share and Current Price (.48)
    • P/E Ratio and Current Price (.26)
    • P/E Ratio and Volatility (.26)
  • These series all show a negative correlation of less than -.25:

    • Price Change and Volatility (-.41)
    • Net Income and Volatility (-.38)
    • Net Income and ROE (-.29)
    • Earnings per Share and Volatility (-.38)
    • Earnings per Share and P/E Ratio (-.26)

When we examine correlations by sector using our scaled data, we can observe some interesting information.

  • Ten different correlations exist that are .75 or higher.
  • The highest correlation exists between P/B Ratio and Volatility, for the Telecommunications Services sector.
  • Telecommunications Services claims the top three correlations out of all features and sectors.
  • Consumer Discretionary and Financials have two apiece of the top 10 correlations.
  • Net Income and Estimated Shares Outstanding surfaced as a top correlation for four different sectors.

project_7_corrheatmap_scaled.png

Q4: How does the average cash ratio vary across economic sectors?¶

Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents.

In [ ]:
# Create pivot table for cash ratio by sector
df_eda_sector_cash_ratio_pivot = df_eda.pivot_table(
    index='GICS Sector',
    values='Cash Ratio',
    aggfunc='mean'
)

# Sort the DataFrame by mean in descending order
df_eda_sector_cash_ratio_pivot.sort_values(by='Cash Ratio', ascending=False, inplace=True)

# Display the combined DataFrame
display(df_eda_sector_cash_ratio_pivot)
Cash Ratio
GICS Sector
Information Technology 149.818182
Telecommunications Services 117.000000
Health Care 103.775000
Financials 98.591837
Consumer Staples 70.947368
Energy 51.133333
Real Estate 50.111111
Consumer Discretionary 49.575000
Materials 41.700000
Industrials 36.188679
Utilities 13.625000
In [ ]:
# Visualize cash ratio
plt.figure(figsize=(10, 5))
sns.barplot(data=df_eda.sort_values(by='Cash Ratio', ascending=False), x='GICS Sector', y='Cash Ratio', hue='GICS Sector')
plt.xticks(rotation=90)
plt.show()

ANSWER

As a whole, the average Cash Ratio is 70.02, with a median of 47.0. We can further examine by sector.

  • The top five sectors all have an average cash ratio above the average Cash Ratio:

    • Information Technology: 149.82
    • Telecommunications Services: 117.00
    • Health Care: 103.78
    • Financials 98.59
    • Consumer Staples 70.95
  • The bottom three sectors for average cash ratio are:

    • Materials: 41.70
    • Industrials: 36.19
    • Utilities: 13.63
  • The widest ranges of Cash Ratios exist in the Telecommunications Services and Information Technology sectors. The highest levels of Cash Ratios exist in these sectors as well, and companies within this sector have a strong liquidity position, meaning they can easily cover their short-term obligations.

  • Conversely, Utilities, Industrials and Real Estate have the tightest Cash Ratio ranges. Liquidity is low in these industries.

Q5: How does the P/E ratio vary, on average, across economic sectors?¶

P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings.

In [ ]:
# Create pivot table for P/E Ratio by sector
df_eda_sector_pe_ratio_pivot = df_eda.pivot_table(
    index='GICS Sector',
    values='P/E Ratio',
    aggfunc='mean'
)

# Sort the DataFrame by mean in descending order
df_eda_sector_pe_ratio_pivot.sort_values(by='P/E Ratio', ascending=False, inplace=True)

# Display the combined DataFrame
display(df_eda_sector_pe_ratio_pivot)
P/E Ratio
GICS Sector
Energy 72.897709
Information Technology 43.782546
Real Estate 43.065585
Health Care 41.135272
Consumer Discretionary 35.211613
Consumer Staples 25.521195
Materials 24.585352
Utilities 18.719412
Industrials 18.259380
Financials 16.023151
Telecommunications Services 12.222578
In [ ]:
# Visualize P/E ratio
plt.figure(figsize=(10, 5))
sns.barplot(data=df_eda.sort_values(by='P/E Ratio', ascending=False), x='GICS Sector', y='P/E Ratio', hue='GICS Sector')
plt.xticks(rotation=90)
plt.show()

ANSWER

The average P/E Ratio across all sectors is 32.61, with a median of 20.82. Again, we can further examine by sector.

Five sectors all have an average P/E ratio above the average, and companies are holding onto cash longer. The market is likely bullish on these industries and expect future growth and/or profitablity.

  • Energy: 72.90
  • Information Technology: 43.78
  • Real Estate: 43.07
  • Health Care: 41.14
  • Consumer Discretionary: 35.21

The bottom three sectors for average P/E ratio are as follows. These sectors are likely not expected to grow much or generate much profit, and do not maintain large cash reserves.

  • Industrials: 18.26
  • Financials: 16.02
  • Telecommunications Services: 12.22
  • The widest range of P/E Ratios exist in the Information Technology and Consumer Discretionary sectors.

  • The tightest range of P/E ratios exist in the Industrials, Financials, Utilities and Telecommunications Services sectors.

Unsupervised Learning Analysis¶

The goal of our data analysis is to group stocks into similar clusters based on the data attributes we have in our dataset, so that we can define the characteristics of each group.

To do this, we need to:

  • Understand the data through univariate and bivariate analysis (completed earlier)
  • Pre-process the data to handle missing values and normalize/scale the data
  • Decide what type of clustering solution to use
    • Use K-means clustering
    • Use Hierarchical clustering
    • Use Principal Component Analysis
  • Select the algorithm we want to use and fit our final model
  • Interpret the results

Helpers¶

In [ ]:
# Import additional libraries
from scipy.spatial.distance import cdist

# Library to compute distances
from scipy.spatial.distance import cdist, pdist

# Library to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Library for principal component analysis
from sklearn.decomposition import PCA

# Library for scaling; RobustScaler is supposed to work better for datasets with outliers
from sklearn.preprocessing import RobustScaler

# Libraries for visualization
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.manifold import TSNE

# Library to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

Data Preprocessing¶

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Feature engineering (if needed)
  • Any other preprocessing steps (if needed)
In [ ]:
# Copy the original dataframe before we begin
df_model = df.copy()

Value Check¶

We know from earlier that we are not missing any values, but let's double check.

In [ ]:
# Check data types
df_model.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
In [ ]:
# Describe the data. Can also use include=all to also show categorical series.
df_model.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Ticker Symbol 340 340 AAL 1 NaN NaN NaN NaN NaN NaN NaN
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net Cash Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327937 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings Per Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated Shares Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585

Outlier Treatment¶

Our dataset is very small (340 rows of data). Below, we can observe quite a few outliers. However, eliminating these would remove a significant amount of data. In addition, the top outlier in each area isn't consistently a single security, so we know that the most extreme outliers are not all the same security.

For these reasons, we will not remove any outliers. However, since we can observe a high number of outliers, we should carefully consider which clustering algorithms we select.

In [ ]:
# Check each feature to see if we have any single ticker/row as an outlier in all features

features = df_model.columns.difference(['Security'])

# Loop through each feature and print the Ticker Name with the maximum value
for feature in features:
    max_value_row = df_model.loc[df_model[feature].idxmax()]
    min_value_row = df_model.loc[df_model[feature].idxmin()]
    print(f'**{feature}**')
    print(f'  Max:  {max_value_row["Security"]} ({max_value_row[feature]})')
    print(f'  Min:  {min_value_row["Security"]} ({min_value_row[feature]})')
**Cash Ratio**
  Max:  Facebook (958)
  Min:  Charter Communications (0)
**Current Price**
  Max:  Priceline.com Inc (1274.949951)
  Min:  Chesapeake Energy (4.5)
**Earnings Per Share**
  Max:  Priceline.com Inc (50.09)
  Min:  Apache Corporation (-61.2)
**Estimated Shares Outstanding**
  Max:  Pfizer Inc. (6159292035.0)
  Min:  Mettler Toledo (27672156.86)
**GICS Sector**
  Max:  Ameren Corp (Utilities)
  Min:  Amazon.com Inc (Consumer Discretionary)
**GICS Sub Industry**
  Max:  American Water Works Company Inc (Water Utilities)
  Min:  Interpublic Group (Advertising)
**Net Cash Flow**
  Max:  Bank of America Corp (20764000000)
  Min:  Citigroup Inc. (-11208000000)
**Net Income**
  Max:  JPMorgan Chase & Co. (24442000000)
  Min:  Apache Corporation (-23528000000)
**P/B Ratio**
  Max:  Alliance Data Systems (129.0645854)
  Min:  Charter Communications (-76.11907749)
**P/E Ratio**
  Max:  Amazon.com Inc (528.0390742)
  Min:  United Continental Holdings (2.935450768)
**Price Change**
  Max:  First Solar Inc (55.05168339)
  Min:  Kinder Morgan (-47.12969338)
**ROE**
  Max:  Apache Corporation (917)
  Min:  Concho Resources (1)
**Ticker Symbol**
  Max:  Zoetis (ZTS)
  Min:  American Airlines Group (AAL)
**Volatility**
  Max:  Southwestern Energy (4.58004173)
  Min:  McDonald's Corp. (0.733163184)
In [ ]:
# Create tight boxplots to show outliers
plt.figure(figsize=(12, 10))

num_cols = df_model.select_dtypes(include=np.number).columns.tolist()

for i, col in enumerate(num_cols):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df_model[col], whis=1.5)
    plt.tight_layout()
    plt.title(col)

plt.show()
In [ ]:
df_model.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
In [ ]:
# One-hot encode variables for Sector and Sub Sector: Determined we don't want to do this
#df_model_clean = pd.get_dummies(df_model_clean, columns=['GICS Sector'])
#df_model_clean = pd.get_dummies(df_model_clean, columns=['GICS Sub Industry'])
In [ ]:
# Drop unique values: Determined we don't want to do this
#df_model_clean = df_model_clean.drop(['Ticker Symbol', 'Security'], axis=1)
In [ ]:
df_model.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Ticker Symbol 340 340 AAL 1 NaN NaN NaN NaN NaN NaN NaN
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net Cash Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327937 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings Per Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated Shares Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585

Scaling Treatment¶

We observed earlier that we have a high outlier count, and also major discrepancies of scale in our features. When we examined correlations, we normalized the data. We will need to similar do this to our base model data.

We can choose between several different types of scaling - MinMaxScaler, Z-score, or Robust. Since Robust works well with datasets containing outliers, we will use it.

In [ ]:
# Initialize the RobustScaler
scaler = RobustScaler()

subset = df_model.select_dtypes(include=['number']).copy()

# Scale the data
subset_scaled = scaler.fit_transform(subset)

# Convert to DataFrame
df_model_scaled = pd.DataFrame(subset_scaled, columns=subset.columns)

Cluster Algorithm Selection¶

Next, let's see which clustering technique works best for our scaled dataset. To do this, we will first run PCA, and then use two clustering techniques.

  • Principal Component Analysis (reduce dimensionality)
  • K-Means Clustering (find ideal number of clusters)
  • Hierarchical Clustering

K-means Clustering¶

First, we will attempt to visualize various k values to determine which one we want to use moving forward. To do this we will leverage the elbow method and visualize the distortion for each k value.

In [ ]:
# Measure execution time
%%time

# Identify desired cluster count
df_k_means = df_model_scaled.copy()

clusters = range(1, 16)
meanDistortions = []

# Loop through cluster values
for k in clusters:
    model_k_means = KMeans(n_clusters=k)
    model_k_means.fit(df_k_means)
    prediction = model_k_means.predict(df_k_means)
    distortion = (
        sum(
            np.min(cdist(df_k_means, model_k_means.cluster_centers_, 'euclidean'), axis=1)
        )
        / df_k_means.shape[0]
    )

    meanDistortions.append(distortion)

    print('# of Clusters:', k, '\tAverage Distortion:', distortion)

plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Avg Distortion')
plt.title('Selecting k with Elbow Method')
plt.show()
# of Clusters: 1 	Average Distortion: 6.165015784179755
# of Clusters: 2 	Average Distortion: 5.519913922135517
# of Clusters: 3 	Average Distortion: 5.301175199731372
# of Clusters: 4 	Average Distortion: 5.128821925696185
# of Clusters: 5 	Average Distortion: 4.852367478608126
# of Clusters: 6 	Average Distortion: 4.564025418031926
# of Clusters: 7 	Average Distortion: 4.454884657269399
# of Clusters: 8 	Average Distortion: 4.221774350762034
# of Clusters: 9 	Average Distortion: 4.191296800846647
# of Clusters: 10 	Average Distortion: 4.079244586736443
# of Clusters: 11 	Average Distortion: 3.829791083252664
# of Clusters: 12 	Average Distortion: 3.753949415523762
# of Clusters: 13 	Average Distortion: 3.6369603635468146
# of Clusters: 14 	Average Distortion: 3.596860202066002
# of Clusters: 15 	Average Distortion: 3.445799561982341
CPU times: user 409 ms, sys: 112 ms, total: 521 ms
Wall time: 488 ms
In [ ]:
# Fit the model to the visualizer
model_k_means_2 = KMeans(random_state=1)
visualizer = KElbowVisualizer(model_k_means_2, k=(1, 16), timings=True)
visualizer.fit(df_k_means)
visualizer.show()
Out[ ]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
In [ ]:
# Check silhouette scores
sil_score = []

cluster_list = range(2, 16)

for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((df_k_means))
    score = silhouette_score(df_k_means, preds)
    sil_score.append(score)
    print('n_clusters={}: Silhouette score={})'.format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()
n_clusters=2: Silhouette score=0.7757324276239033)
n_clusters=3: Silhouette score=0.7454764840420494)
n_clusters=4: Silhouette score=0.4621955535259713)
n_clusters=5: Silhouette score=0.4829025199901958)
n_clusters=6: Silhouette score=0.4813819928027789)
n_clusters=7: Silhouette score=0.49576783422468745)
n_clusters=8: Silhouette score=0.411864134740954)
n_clusters=9: Silhouette score=0.4012633404812339)
n_clusters=10: Silhouette score=0.3763626634759577)
n_clusters=11: Silhouette score=0.3505016857108297)
n_clusters=12: Silhouette score=0.3613349297613386)
n_clusters=13: Silhouette score=0.3806084017704855)
n_clusters=14: Silhouette score=0.3183846743555661)
n_clusters=15: Silhouette score=0.27805939414564607)
In [ ]:
# Visualize the silhouette score for various values between 5 and 8

for i in range(5, 9):
    visualizer = SilhouetteVisualizer(KMeans(i, random_state=1))
    visualizer.fit(df_k_means)
    visualizer.show()
In [ ]:
# Run again for the ideal cluster count for k-means
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(df_model_scaled)
Out[ ]:
KMeans(n_clusters=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=1)
In [ ]:
# Apply the k-means to a final model
df_kmean = df_model.copy()

# Add labels
df_kmean['Cluster'] = kmeans.labels_
df_model_scaled['Cluster'] = kmeans.labels_
Cluster Profile¶
In [ ]:
# Show counts by cluster
display(df_kmean['Cluster'].value_counts())
count
Cluster
3 303
1 16
0 12
2 7
4 2

In [ ]:
# Show the securities in each cluster
for cluster in df_kmean['Cluster'].unique():
    cluster_securities = df_kmean[df_kmean['Cluster'] == cluster]['Security']
    print(f'Cluster {cluster}: {", ".join(cluster_securities)}')
Cluster 1: American Airlines Group, AbbVie, Devon Energy Corp., Freeport-McMoran Cp & Gld, IDEXX Laboratories, Masco Corp., Moody's Corp, Altria Group Inc, Newfield Exploration Co, Pitney-Bowes, Sherwin-Williams, Southwestern Energy, United Parcel Service, Wynn Resorts Ltd, Cimarex Energy, Yum! Brands Inc
Cluster 3: Abbott Laboratories, Adobe Systems Inc, Analog Devices, Inc., Archer-Daniels-Midland Co, Alliance Data Systems, Ameren Corp, American Electric Power, AFLAC Inc, American International Group, Inc., Apartment Investment & Mgmt, Assurant Inc, Arthur J. Gallagher & Co., Akamai Technologies Inc, Albemarle Corp, Alaska Air Group Inc, Allstate Corp, Alexion Pharmaceuticals, Applied Materials Inc, AMETEK Inc, Affiliated Managers Group Inc, Amgen Inc, Ameriprise Financial, American Tower Corp A, Amazon.com Inc, AutoNation Inc, Anthem Inc., Aon plc, Amphenol Corp, Arconic Inc, Activision Blizzard, AvalonBay Communities, Inc., Broadcom, American Water Works Company Inc, American Express Co, Boeing Company, Baxter International Inc., BB&T Corporation, Bard (C.R.) Inc., Baker Hughes Inc, BIOGEN IDEC Inc., The Bank of New York Mellon Corp., Ball Corp, Boston Scientific, BorgWarner, Boston Properties, Caterpillar Inc., Chubb Limited, CBRE Group, Crown Castle International Corp., Carnival Corp., Celgene Corp., CF Industries Holdings Inc, Citizens Financial Group, Church & Dwight, C. H. Robinson Worldwide, CIGNA Corp., Cincinnati Financial, Comerica Inc., CME Group Inc., Chipotle Mexican Grill, Cummins Inc., CMS Energy, Centene Corporation, CenterPoint Energy, Capital One Financial, Cabot Oil & Gas, The Cooper Companies, CSX Corp., CenturyLink Inc, Cognizant Technology Solutions, Citrix Systems, CVS Health, Chevron Corp., Concho Resources, Dominion Resources, Delta Air Lines, Du Pont (E.I.), Deere & Co., Discover Financial Services, Quest Diagnostics, Danaher Corp., The Walt Disney Company, Discovery Communications-A, Discovery Communications-C, Delphi Automotive, Digital Realty Trust, Dun & Bradstreet, Dover Corp., Dr Pepper Snapple Group, Duke Energy, DaVita Inc., Ecolab Inc., Consolidated Edison, Equifax Inc., Edison Int'l, Eastman Chemical, EOG Resources, Equinix, Equity Residential, EQT Corporation, Eversource Energy, Essex Property Trust, Inc., E*Trade, Eaton Corporation, Entergy Corp., Edwards Lifesciences, Exelon Corp., Expeditors Int'l, Expedia Inc., Extra Space Storage, Ford Motor, Fastenal Co, Facebook, Fortune Brands Home & Security, FirstEnergy Corp, Fidelity National Information Services, Fiserv Inc, FLIR Systems, Fluor Corp., Flowserve Corporation, FMC Corporation, Federal Realty Investment Trust, First Solar Inc, Frontier Communications, General Dynamics, General Growth Properties Inc., Gilead Sciences, Corning Inc., Genuine Parts, Garmin Ltd., Goodyear Tire & Rubber, Grainger (W.W.) Inc., Halliburton Co., Hasbro Inc., Huntington Bancshares, HCA Holdings, Welltower Inc., HCP Inc., Hess Corporation, Hartford Financial Svc.Gp., Harley-Davidson, Honeywell Int'l Inc., Hewlett Packard Enterprise, HP Inc., Hormel Foods Corp., Henry Schein, Host Hotels & Resorts, The Hershey Company, Humana Inc., International Business Machines, Intl Flavors & Fragrances, International Paper, Interpublic Group, Iron Mountain Incorporated, Intuitive Surgical Inc., Illinois Tool Works, Invesco Ltd., J. B. Hunt Transport Services, Jacobs Engineering Group, Juniper Networks, Kimco Realty, Kinder Morgan, Kansas City Southern, Leggett & Platt, Lennar Corp., Laboratory Corp. of America Holding, LKQ Corporation, L-3 Communications Holdings, Lilly (Eli) & Co., Lockheed Martin Corp., Alliant Energy Corp, Leucadia National Corp., Southwest Airlines, Level 3 Communications, LyondellBasell, Mastercard Inc., Mid-America Apartments, Macerich, Marriott Int'l., Mattel Inc., McDonald's Corp., Mondelez International, MetLife Inc., Mohawk Industries, Mead Johnson, McCormick & Co., Martin Marietta Materials, Marsh & McLennan, 3M Company, Monster Beverage, The Mosaic Company, Marathon Petroleum, Merck & Co., Marathon Oil Corp., M&T Bank Corp., Mettler Toledo, Murphy Oil, Mylan N.V., Navient, Noble Energy Inc, NASDAQ OMX Group, NextEra Energy, Newmont Mining Corp. (Hldg. Co.), Netflix Inc., Nielsen Holdings, National Oilwell Varco Inc., Norfolk Southern Corp., Northern Trust Corp., Nucor Corp., Newell Brands, Realty Income Corporation, ONEOK, Omnicom Group, O'Reilly Automotive, Occidental Petroleum, People's United Financial, PACCAR Inc., PG&E Corp., Priceline.com Inc, Public Serv. Enterprise Inc., PepsiCo Inc., Principal Financial Group, Procter & Gamble, Progressive Corp., Pulte Homes Inc., Philip Morris International, PNC Financial Services, Pentair Ltd., Pinnacle West Capital, PPG Industries, PPL Corp., Prudential Financial, Phillips 66, Quanta Services Inc., Praxair Inc., PayPal, Ryder System, Royal Caribbean Cruises Ltd, Regeneron, Robert Half International, Roper Industries, Range Resources Corp., Republic Services Inc, SCANA Corp, Charles Schwab Corporation, Spectra Energy Corp., Sealed Air, SL Green Realty, Scripps Networks Interactive Inc., Southern Co., Simon Property Group Inc, Stericycle Inc, Sempra Energy, SunTrust Banks, State Street Corp., Skyworks Solutions, Synchrony Financial, Stryker Corp., Molson Coors Brewing Company, Teradata Corp., Tegna, Inc., Torchmark Corp., Thermo Fisher Scientific, TripAdvisor, The Travelers Companies Inc., Tractor Supply Company, Tyson Foods, Tesoro Petroleum Co., Total System Services, Texas Instruments, Under Armour, United Continental Holdings, UDR Inc, Universal Health Services, Inc., United Health Group Inc., Unum Group, Union Pacific, United Technologies, Varian Medical Systems, Valero Energy, Vulcan Materials, Vornado Realty Trust, Verisk Analytics, Verisign Inc., Vertex Pharmaceuticals Inc, Ventas Inc, Waters Corporation, Wec Energy Group Inc, Whirlpool Corp., Waste Management Inc., Williams Cos., Western Union Co, Weyerhaeuser Corp., Wyndham Worldwide, Xcel Energy Inc, XL Capital, Dentsply Sirona, Xerox Corp., Xylem Inc., Yahoo Inc., Zimmer Biomet Holdings, Zions Bancorp, Zoetis
Cluster 2: Allegion, Apache Corporation, Chesapeake Energy, Charter Communications, Colgate-Palmolive, Kimberly-Clark, S&P Global, Inc.
Cluster 0: Anadarko Petroleum Corp, Bristol-Myers Squibb, Citigroup Inc., eBay Inc., General Motors, JPMorgan Chase & Co., Coca Cola Company, Pfizer Inc., AT&T Inc, Verizon Communications, Wells Fargo, Exxon Mobil Corp.
Cluster 4: Bank of America Corp, Intel Corp.
In [ ]:
# Create a crosstab for sectors and clusters
df_kmean_crosstab_with_totals = pd.crosstab(df_kmean['Cluster'], df_kmean['GICS Sector'], margins=True, margins_name="Total")

# Display the crosstab with totals
df_kmean_crosstab_with_totals.T  # Transpose to switch rows/columns if needed
Out[ ]:
Cluster 0 1 2 3 4 Total
GICS Sector
Consumer Discretionary 1 2 1 36 0 40
Consumer Staples 1 1 2 15 0 19
Energy 2 4 2 22 0 30
Financials 3 1 1 43 1 49
Health Care 2 2 0 36 0 40
Industrials 0 4 1 48 0 53
Information Technology 1 0 0 31 1 33
Materials 0 2 0 18 0 20
Real Estate 0 0 0 27 0 27
Telecommunications Services 2 0 0 3 0 5
Utilities 0 0 0 24 0 24
Total 12 16 7 303 2 340
In [ ]:
# Show means for our original columns by Cluster
#df_model_k_means_grouped = df_model_k_means.groupby('Cluster').mean()

# Loop through the selected columns and plot boxplots
for col in list_numcol:
    if col in df_kmean.columns:  # Ensure the column exists in the DataFrame
        plt.figure(figsize=(8, 5))
        sns.boxplot(data=df_kmean, x='Cluster', y=col, hue='Cluster', palette='viridis', legend='full')
        plt.xticks(rotation=90)
        plt.title(f'Boxplot of {col} by Cluster')
        plt.show()

OBSERVATION

It appears that a k value of 5 or 6 would work best for our dataset, as it has the highest silhouette score, and also emerged as a key value in the elbow exercise.

After using K-means to predict the clusters on our dataset, we can see the companies break out as follows into our five clusters:

Cluster 0 (12 Companies)

Anadarko Petroleum Corp, Bristol-Myers Squibb, Citigroup Inc., eBay Inc., General Motors, JPMorgan Chase & Co., Coca Cola Company, Pfizer Inc., AT&T Inc, Verizon Communications, Wells Fargo, Exxon Mobil Corp.

Cluster 1 (16 Companies)

American Airlines Group, AbbVie, Devon Energy Corp., Freeport-McMoran Cp & Gld, IDEXX Laboratories, Masco Corp., Moody's Corp, Altria Group Inc, Newfield Exploration Co, Pitney-Bowes, Sherwin-Williams, Southwestern Energy, United Parcel Service, Wynn Resorts Ltd, Cimarex Energy, Yum! Brands Inc

Cluster 2 (7 Companies)

Allegion, Apache Corporation, Chesapeake Energy, Charter Communications, Colgate-Palmolive, Kimberly-Clark, S&P Global, Inc.

Cluster 4 (2 Companies)

Bank of America Corp, Intel Corp.

Cluster 3 (All remaining 303 Companies)

Abbott Laboratories, Adobe Systems Inc, Analog Devices, Inc., Archer-Daniels-Midland Co, Alliance Data Systems, Ameren Corp, American Electric Power, AFLAC Inc, American International Group, Inc., Apartment Investment & Mgmt, Assurant Inc, Arthur J. Gallagher & Co., Akamai Technologies Inc, Albemarle Corp, Alaska Air Group Inc, Allstate Corp, Alexion Pharmaceuticals, Applied Materials Inc, AMETEK Inc, Affiliated Managers Group Inc, Amgen Inc, Ameriprise Financial, American Tower Corp A, Amazon.com Inc, AutoNation Inc, Anthem Inc., Aon plc, Amphenol Corp, Arconic Inc, Activision Blizzard, AvalonBay Communities, Inc., Broadcom, American Water Works Company Inc, American Express Co, Boeing Company, Baxter International Inc., BB&T Corporation, Bard (C.R.) Inc., Baker Hughes Inc, BIOGEN IDEC Inc., The Bank of New York Mellon Corp., Ball Corp, Boston Scientific, BorgWarner, Boston Properties, Caterpillar Inc., Chubb Limited, CBRE Group, Crown Castle International Corp., Carnival Corp., Celgene Corp., CF Industries Holdings Inc, Citizens Financial Group, Church & Dwight, C. H. Robinson Worldwide, CIGNA Corp., Cincinnati Financial, Comerica Inc., CME Group Inc., Chipotle Mexican Grill, Cummins Inc., CMS Energy, Centene Corporation, CenterPoint Energy, Capital One Financial, Cabot Oil & Gas, The Cooper Companies, CSX Corp., CenturyLink Inc, Cognizant Technology Solutions, Citrix Systems, CVS Health, Chevron Corp., Concho Resources, Dominion Resources, Delta Air Lines, Du Pont (E.I.), Deere & Co., Discover Financial Services, Quest Diagnostics, Danaher Corp., The Walt Disney Company, Discovery Communications-A, Discovery Communications-C, Delphi Automotive, Digital Realty Trust, Dun & Bradstreet, Dover Corp., Dr Pepper Snapple Group, Duke Energy, DaVita Inc., Ecolab Inc., Consolidated Edison, Equifax Inc., Edison Int'l, Eastman Chemical, EOG Resources, Equinix, Equity Residential, EQT Corporation, Eversource Energy, Essex Property Trust, Inc., E*Trade, Eaton Corporation, Entergy Corp., Edwards Lifesciences, Exelon Corp., Expeditors Int'l, Expedia Inc., Extra Space Storage, Ford Motor, Fastenal Co, Facebook, Fortune Brands Home & Security, FirstEnergy Corp, Fidelity National Information Services, Fiserv Inc, FLIR Systems, Fluor Corp., Flowserve Corporation, FMC Corporation, Federal Realty Investment Trust, First Solar Inc, Frontier Communications, General Dynamics, General Growth Properties Inc., Gilead Sciences, Corning Inc., Genuine Parts, Garmin Ltd., Goodyear Tire & Rubber, Grainger (W.W.) Inc., Halliburton Co., Hasbro Inc., Huntington Bancshares, HCA Holdings, Welltower Inc., HCP Inc., Hess Corporation, Hartford Financial Svc.Gp., Harley-Davidson, Honeywell Int'l Inc., Hewlett Packard Enterprise, HP Inc., Hormel Foods Corp., Henry Schein, Host Hotels & Resorts, The Hershey Company, Humana Inc., International Business Machines, Intl Flavors & Fragrances, International Paper, Interpublic Group, Iron Mountain Incorporated, Intuitive Surgical Inc., Illinois Tool Works, Invesco Ltd., J. B. Hunt Transport Services, Jacobs Engineering Group, Juniper Networks, Kimco Realty, Kinder Morgan, Kansas City Southern, Leggett & Platt, Lennar Corp., Laboratory Corp. of America Holding, LKQ Corporation, L-3 Communications Holdings, Lilly (Eli) & Co., Lockheed Martin Corp., Alliant Energy Corp, Leucadia National Corp., Southwest Airlines, Level 3 Communications, LyondellBasell, Mastercard Inc., Mid-America Apartments, Macerich, Marriott Int'l., Mattel Inc., McDonald's Corp., Mondelez International, MetLife Inc., Mohawk Industries, Mead Johnson, McCormick & Co., Martin Marietta Materials, Marsh & McLennan, 3M Company, Monster Beverage, The Mosaic Company, Marathon Petroleum, Merck & Co., Marathon Oil Corp., M&T Bank Corp., Mettler Toledo, Murphy Oil, Mylan N.V., Navient, Noble Energy Inc, NASDAQ OMX Group, NextEra Energy, Newmont Mining Corp. (Hldg. Co.), Netflix Inc., Nielsen Holdings, National Oilwell Varco Inc., Norfolk Southern Corp., Northern Trust Corp., Nucor Corp., Newell Brands, Realty Income Corporation, ONEOK, Omnicom Group, O'Reilly Automotive, Occidental Petroleum, People's United Financial, PACCAR Inc., PG&E Corp., Priceline.com Inc, Public Serv. Enterprise Inc., PepsiCo Inc., Principal Financial Group, Procter & Gamble, Progressive Corp., Pulte Homes Inc., Philip Morris International, PNC Financial Services, Pentair Ltd., Pinnacle West Capital, PPG Industries, PPL Corp., Prudential Financial, Phillips 66, Quanta Services Inc., Praxair Inc., PayPal, Ryder System, Royal Caribbean Cruises Ltd, Regeneron, Robert Half International, Roper Industries, Range Resources Corp., Republic Services Inc, SCANA Corp, Charles Schwab Corporation, Spectra Energy Corp., Sealed Air, SL Green Realty, Scripps Networks Interactive Inc., Southern Co., Simon Property Group Inc, Stericycle Inc, Sempra Energy, SunTrust Banks, State Street Corp., Skyworks Solutions, Synchrony Financial, Stryker Corp., Molson Coors Brewing Company, Teradata Corp., Tegna, Inc., Torchmark Corp., Thermo Fisher Scientific, TripAdvisor, The Travelers Companies Inc., Tractor Supply Company, Tyson Foods, Tesoro Petroleum Co., Total System Services, Texas Instruments, Under Armour, United Continental Holdings, UDR Inc, Universal Health Services, Inc., United Health Group Inc., Unum Group, Union Pacific, United Technologies, Varian Medical Systems, Valero Energy, Vulcan Materials, Vornado Realty Trust, Verisk Analytics, Verisign Inc., Vertex Pharmaceuticals Inc, Ventas Inc, Waters Corporation, Wec Energy Group Inc, Whirlpool Corp., Waste Management Inc., Williams Cos., Western Union Co, Weyerhaeuser Corp., Wyndham Worldwide, Xcel Energy Inc, XL Capital, Dentsply Sirona, Xerox Corp., Xylem Inc., Yahoo Inc., Zimmer Biomet Holdings, Zions Bancorp, Zoetis

Hierarchical Clustering¶

Next, we will try hierarchical clusterings.

The first thing we will do is figure out which distance metric and linkage type to use.

In [ ]:
df_hierarchical = df_model_scaled.copy()
In [ ]:
#Measure execution time
%%time

# Create list of distance metrics
distance_metrics = ['euclidean', 'chebyshev', 'mahalanobis', 'cityblock']

# Create list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted']

high_cophenet_corr = 0
high_dm_lm = [0, 0]

# Calculate the cophentics for each combination of distance type and linkage method
for dm in distance_metrics:
    for lm in linkage_methods:
        try:
            # Try to compute linkage and cophenet correlation
            Z = linkage(df_hierarchical, metric=dm, method=lm)
            c, coph_dists = cophenet(Z, pdist(df_hierarchical))

            # Print cophenetic correlation result
            print(
                'Cophenetic correlation for {} distance and {} linkage is {}.'.format(
                    dm.capitalize(), lm, c
                )
            )

            # Update the highest cophenetic correlation
            if high_cophenet_corr < c:
                high_cophenet_corr = c
                high_dm_lm[0] = dm
                high_dm_lm[1] = lm

        except Exception as e:
            # Not all distance types can be used with all linkage types, so we need to gracefully handle errors
            print(f'Error for {dm.capitalize()} distance and {lm} linkage: {e}')
Cophenetic correlation for Euclidean distance and single linkage is 0.9515911691498813.
Cophenetic correlation for Euclidean distance and complete linkage is 0.8973300286572553.
Cophenetic correlation for Euclidean distance and average linkage is 0.9456134806319734.
Cophenetic correlation for Euclidean distance and centroid linkage is 0.9468040178955592.
Cophenetic correlation for Euclidean distance and ward linkage is 0.8184834889632843.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8824190968054058.
Cophenetic correlation for Chebyshev distance and single linkage is 0.944359706184292.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.8863927240886589.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9381981458216698.
Error for Chebyshev distance and centroid linkage: `method=centroid` requires the distance metric to be Euclidean
Error for Chebyshev distance and ward linkage: `method=ward` requires the distance metric to be Euclidean
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8839528285551623.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.8688554364012083.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.49533463354436585.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.8380995772123909.
Error for Mahalanobis distance and centroid linkage: `method=centroid` requires the distance metric to be Euclidean
Error for Mahalanobis distance and ward linkage: `method=ward` requires the distance metric to be Euclidean
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7167075784099872.
Cophenetic correlation for Cityblock distance and single linkage is 0.9209089428269329.
Cophenetic correlation for Cityblock distance and complete linkage is 0.8440926567175726.
Cophenetic correlation for Cityblock distance and average linkage is 0.9363186631034428.
Error for Cityblock distance and centroid linkage: `method=centroid` requires the distance metric to be Euclidean
Error for Cityblock distance and ward linkage: `method=ward` requires the distance metric to be Euclidean
Cophenetic correlation for Cityblock distance and weighted linkage is 0.8593697074578455.
CPU times: user 197 ms, sys: 102 ms, total: 299 ms
Wall time: 185 ms
In [ ]:
# Print the ideal combination
print(
    'Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.'.format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.9515911691498813, which is obtained with Euclidean distance and single linkage.

The above code tells us that the best combination is Euclidean distance with single linkage. However, we need to visual the dendograms to see if this recommendation is suitable or not.

Let's re-run linkages, but using only Euclidian distances this time.

In [ ]:
# View dendograms

# Create list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted'] ## Complete the code to add linkages

# Create lists to save results
compare_cols = ['Linkage', 'Cophenetic Coefficient']

# Create subplot
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# Loop through thelinkage methods and plot dendrogram and calculate cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(df_hierarchical, metric='euclidean', method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f'Dendrogram ({method.capitalize()} Linkage)')

    coph_corr, coph_dist = cophenet(Z, pdist(df_hierarchical))
    axs[i].annotate(
        f'Cophenetic\nCorrelation\n{coph_corr:0.2f}',
        (0.80, 0.80),
        xycoords='axes fraction',
    )

Next, we need to fit a new model using our chosen distance and linkage approach.

  • Single, centroid and average have the highest cophenetic correlation values, but also exhibit overlaps:

  • Single (Cophenetic Correlation = .95)

  • Average (Cophenetic Correlation = .95)
  • Centroid (Cophenetic Correlation = .95)

Several of the dendograms have lower scores but appear well organized.

  • Complete (Cophenetic Correlation = .90)
  • Ward (Cophenetic Correlation = .82)
  • Weighted (Cophenetic Correlation = .88)

  • We will choose complete, as the dendogram is very clean, and the value is not much lower than the single linkage (.95). In addition, we'll choose a cluster count of 5 again.

In [ ]:
# Fit the model with the desired number of clusters and selected distance and linkage type
# Note that affinity is now called metric in the function below
df_model_hc_5 = AgglomerativeClustering(n_clusters=5, metric='euclidean', linkage='complete')
df_model_hc_5.fit(df_hierarchical)
Out[ ]:
AgglomerativeClustering(linkage='complete', n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='complete', n_clusters=5)
In [ ]:
# Apply the hierarchical clustering to a final model
df_hc = df_model.copy()

# Add labels
df_hierarchical['Cluster'] = df_model_hc_5.labels_
df_hc['Cluster'] = df_model_hc_5.labels_
Cluster Profile¶
In [ ]:
# Group and count the results
display(df_hc['Cluster'].value_counts())
count
Cluster
1 318
0 10
2 7
3 3
4 2

In [ ]:
# Show the securities in each cluster
for cluster in df_hc['Cluster'].unique():
    cluster_securities = df_hc[df_hc['Cluster'] == cluster]['Security']
    print(f'Cluster {cluster}: {", ".join(cluster_securities)}')
Cluster 1: American Airlines Group, AbbVie, Abbott Laboratories, Adobe Systems Inc, Analog Devices, Inc., Archer-Daniels-Midland Co, Alliance Data Systems, Ameren Corp, American Electric Power, AFLAC Inc, American International Group, Inc., Apartment Investment & Mgmt, Assurant Inc, Arthur J. Gallagher & Co., Akamai Technologies Inc, Albemarle Corp, Alaska Air Group Inc, Allstate Corp, Applied Materials Inc, AMETEK Inc, Affiliated Managers Group Inc, Amgen Inc, Ameriprise Financial, American Tower Corp A, AutoNation Inc, Anthem Inc., Aon plc, Anadarko Petroleum Corp, Amphenol Corp, Arconic Inc, Activision Blizzard, AvalonBay Communities, Inc., Broadcom, American Water Works Company Inc, American Express Co, Boeing Company, Baxter International Inc., BB&T Corporation, Bard (C.R.) Inc., Baker Hughes Inc, BIOGEN IDEC Inc., The Bank of New York Mellon Corp., Ball Corp, Bristol-Myers Squibb, Boston Scientific, BorgWarner, Boston Properties, Caterpillar Inc., Chubb Limited, CBRE Group, Crown Castle International Corp., Carnival Corp., Celgene Corp., CF Industries Holdings Inc, Citizens Financial Group, Church & Dwight, C. H. Robinson Worldwide, CIGNA Corp., Cincinnati Financial, Comerica Inc., CME Group Inc., Chipotle Mexican Grill, Cummins Inc., CMS Energy, Centene Corporation, CenterPoint Energy, Capital One Financial, Cabot Oil & Gas, The Cooper Companies, CSX Corp., CenturyLink Inc, Cognizant Technology Solutions, Citrix Systems, CVS Health, Chevron Corp., Concho Resources, Dominion Resources, Delta Air Lines, Du Pont (E.I.), Deere & Co., Discover Financial Services, Quest Diagnostics, Danaher Corp., The Walt Disney Company, Discovery Communications-A, Discovery Communications-C, Delphi Automotive, Digital Realty Trust, Dun & Bradstreet, Dover Corp., Dr Pepper Snapple Group, Duke Energy, DaVita Inc., Devon Energy Corp., eBay Inc., Ecolab Inc., Consolidated Edison, Equifax Inc., Edison Int'l, Eastman Chemical, EOG Resources, Equinix, Equity Residential, EQT Corporation, Eversource Energy, Essex Property Trust, Inc., E*Trade, Eaton Corporation, Entergy Corp., Edwards Lifesciences, Expeditors Int'l, Expedia Inc., Extra Space Storage, Fastenal Co, Facebook, Fortune Brands Home & Security, Freeport-McMoran Cp & Gld, FirstEnergy Corp, Fidelity National Information Services, Fiserv Inc, FLIR Systems, Fluor Corp., Flowserve Corporation, FMC Corporation, Federal Realty Investment Trust, First Solar Inc, Frontier Communications, General Dynamics, General Growth Properties Inc., Gilead Sciences, Corning Inc., General Motors, Genuine Parts, Garmin Ltd., Goodyear Tire & Rubber, Grainger (W.W.) Inc., Hasbro Inc., Huntington Bancshares, HCA Holdings, Welltower Inc., HCP Inc., Hess Corporation, Hartford Financial Svc.Gp., Harley-Davidson, Honeywell Int'l Inc., HP Inc., Hormel Foods Corp., Henry Schein, Host Hotels & Resorts, The Hershey Company, Humana Inc., International Business Machines, IDEXX Laboratories, Intl Flavors & Fragrances, International Paper, Interpublic Group, Iron Mountain Incorporated, Intuitive Surgical Inc., Illinois Tool Works, Invesco Ltd., J. B. Hunt Transport Services, Jacobs Engineering Group, Juniper Networks, Kimco Realty, Kinder Morgan, Coca Cola Company, Kansas City Southern, Leggett & Platt, Lennar Corp., Laboratory Corp. of America Holding, LKQ Corporation, L-3 Communications Holdings, Lilly (Eli) & Co., Lockheed Martin Corp., Alliant Energy Corp, Leucadia National Corp., Southwest Airlines, Level 3 Communications, LyondellBasell, Mastercard Inc., Mid-America Apartments, Macerich, Marriott Int'l., Masco Corp., Mattel Inc., Moody's Corp, Mondelez International, MetLife Inc., Mohawk Industries, Mead Johnson, McCormick & Co., Martin Marietta Materials, Marsh & McLennan, 3M Company, Monster Beverage, Altria Group Inc, The Mosaic Company, Marathon Petroleum, Merck & Co., Marathon Oil Corp., M&T Bank Corp., Mettler Toledo, Murphy Oil, Mylan N.V., Navient, Noble Energy Inc, NASDAQ OMX Group, NextEra Energy, Newmont Mining Corp. (Hldg. Co.), Newfield Exploration Co, Nielsen Holdings, National Oilwell Varco Inc., Norfolk Southern Corp., Nucor Corp., Newell Brands, Realty Income Corporation, ONEOK, Omnicom Group, O'Reilly Automotive, Occidental Petroleum, People's United Financial, Pitney-Bowes, PACCAR Inc., PG&E Corp., Public Serv. Enterprise Inc., PepsiCo Inc., Pfizer Inc., Principal Financial Group, Procter & Gamble, Progressive Corp., Pulte Homes Inc., Philip Morris International, PNC Financial Services, Pentair Ltd., Pinnacle West Capital, PPG Industries, PPL Corp., Prudential Financial, Phillips 66, Quanta Services Inc., Praxair Inc., PayPal, Ryder System, Royal Caribbean Cruises Ltd, Regeneron, Robert Half International, Roper Industries, Range Resources Corp., Republic Services Inc, SCANA Corp, Charles Schwab Corporation, Spectra Energy Corp., Sealed Air, Sherwin-Williams, SL Green Realty, Scripps Networks Interactive Inc., Southern Co., Simon Property Group Inc, Stericycle Inc, Sempra Energy, SunTrust Banks, State Street Corp., Skyworks Solutions, Southwestern Energy, Synchrony Financial, Stryker Corp., AT&T Inc, Molson Coors Brewing Company, Teradata Corp., Tegna, Inc., Torchmark Corp., Thermo Fisher Scientific, TripAdvisor, The Travelers Companies Inc., Tractor Supply Company, Tyson Foods, Tesoro Petroleum Co., Total System Services, Texas Instruments, Under Armour, United Continental Holdings, UDR Inc, Universal Health Services, Inc., United Health Group Inc., Unum Group, Union Pacific, United Parcel Service, United Technologies, Varian Medical Systems, Valero Energy, Vulcan Materials, Vornado Realty Trust, Verisk Analytics, Verisign Inc., Vertex Pharmaceuticals Inc, Ventas Inc, Waters Corporation, Wec Energy Group Inc, Wells Fargo, Whirlpool Corp., Waste Management Inc., Williams Cos., Western Union Co, Weyerhaeuser Corp., Wyndham Worldwide, Wynn Resorts Ltd, Cimarex Energy, Xcel Energy Inc, XL Capital, Exxon Mobil Corp., Dentsply Sirona, Xerox Corp., Xylem Inc., Yahoo Inc., Yum! Brands Inc, Zimmer Biomet Holdings, Zions Bancorp, Zoetis
Cluster 2: Allegion, Apache Corporation, Chesapeake Energy, Charter Communications, Colgate-Palmolive, Kimberly-Clark, S&P Global, Inc.
Cluster 0: Alexion Pharmaceuticals, Amazon.com Inc, Exelon Corp., Ford Motor, Halliburton Co., Hewlett Packard Enterprise, McDonald's Corp., Netflix Inc., Northern Trust Corp., Priceline.com Inc
Cluster 4: Bank of America Corp, Intel Corp.
Cluster 3: Citigroup Inc., JPMorgan Chase & Co., Verizon Communications
In [ ]:
# Loop through the selected columns and plot boxplots
for col in list_numcol:
    if col in df_hc.columns:  # Ensure the column exists in the DataFrame
        plt.figure(figsize=(8, 5))
        sns.boxplot(data=df_hc, x='Cluster', y=col, hue='Cluster', palette='viridis', legend='full')
        plt.xticks(rotation=90)
        plt.title(f"Boxplot of {col} by Cluster")
        plt.show()

OBSERVATIONS

Using five clusters in hierarchical clustering assigned companies as follows:

Cluster 0 (10 companies)

Alexion Pharmaceuticals, Amazon.com Inc, Exelon Corp., Ford Motor, Halliburton Co., Hewlett Packard Enterprise, McDonald's Corp., Netflix Inc., Northern Trust Corp., Priceline.com Inc

Cluster 2 (7 companies)

Allegion, Apache Corporation, Chesapeake Energy, Charter Communications, Colgate-Palmolive, Kimberly-Clark, S&P Global, Inc.

Cluster 3 (3 companies)

Citigroup Inc., JPMorgan Chase & Co., Verizon Communications

Cluster 4 (2 companies)

Bank of America Corp, Intel Corp.

Cluster 1 (All other 318 companies)

American Airlines Group, AbbVie, Abbott Laboratories, Adobe Systems Inc, Analog Devices, Inc., Archer-Daniels-Midland Co, Alliance Data Systems, Ameren Corp, American Electric Power, AFLAC Inc, American International Group, Inc., Apartment Investment & Mgmt, Assurant Inc, Arthur J. Gallagher & Co., Akamai Technologies Inc, Albemarle Corp, Alaska Air Group Inc, Allstate Corp, Applied Materials Inc, AMETEK Inc, Affiliated Managers Group Inc, Amgen Inc, Ameriprise Financial, American Tower Corp A, AutoNation Inc, Anthem Inc., Aon plc, Anadarko Petroleum Corp, Amphenol Corp, Arconic Inc, Activision Blizzard, AvalonBay Communities, Inc., Broadcom, American Water Works Company Inc, American Express Co, Boeing Company, Baxter International Inc., BB&T Corporation, Bard (C.R.) Inc., Baker Hughes Inc, BIOGEN IDEC Inc., The Bank of New York Mellon Corp., Ball Corp, Bristol-Myers Squibb, Boston Scientific, BorgWarner, Boston Properties, Caterpillar Inc., Chubb Limited, CBRE Group, Crown Castle International Corp., Carnival Corp., Celgene Corp., CF Industries Holdings Inc, Citizens Financial Group, Church & Dwight, C. H. Robinson Worldwide, CIGNA Corp., Cincinnati Financial, Comerica Inc., CME Group Inc., Chipotle Mexican Grill, Cummins Inc., CMS Energy, Centene Corporation, CenterPoint Energy, Capital One Financial, Cabot Oil & Gas, The Cooper Companies, CSX Corp., CenturyLink Inc, Cognizant Technology Solutions, Citrix Systems, CVS Health, Chevron Corp., Concho Resources, Dominion Resources, Delta Air Lines, Du Pont (E.I.), Deere & Co., Discover Financial Services, Quest Diagnostics, Danaher Corp., The Walt Disney Company, Discovery Communications-A, Discovery Communications-C, Delphi Automotive, Digital Realty Trust, Dun & Bradstreet, Dover Corp., Dr Pepper Snapple Group, Duke Energy, DaVita Inc., Devon Energy Corp., eBay Inc., Ecolab Inc., Consolidated Edison, Equifax Inc., Edison Int'l, Eastman Chemical, EOG Resources, Equinix, Equity Residential, EQT Corporation, Eversource Energy, Essex Property Trust, Inc., E*Trade, Eaton Corporation, Entergy Corp., Edwards Lifesciences, Expeditors Int'l, Expedia Inc., Extra Space Storage, Fastenal Co, Facebook, Fortune Brands Home & Security, Freeport-McMoran Cp & Gld, FirstEnergy Corp, Fidelity National Information Services, Fiserv Inc, FLIR Systems, Fluor Corp., Flowserve Corporation, FMC Corporation, Federal Realty Investment Trust, First Solar Inc, Frontier Communications, General Dynamics, General Growth Properties Inc., Gilead Sciences, Corning Inc., General Motors, Genuine Parts, Garmin Ltd., Goodyear Tire & Rubber, Grainger (W.W.) Inc., Hasbro Inc., Huntington Bancshares, HCA Holdings, Welltower Inc., HCP Inc., Hess Corporation, Hartford Financial Svc.Gp., Harley-Davidson, Honeywell Int'l Inc., HP Inc., Hormel Foods Corp., Henry Schein, Host Hotels & Resorts, The Hershey Company, Humana Inc., International Business Machines, IDEXX Laboratories, Intl Flavors & Fragrances, International Paper, Interpublic Group, Iron Mountain Incorporated, Intuitive Surgical Inc., Illinois Tool Works, Invesco Ltd., J. B. Hunt Transport Services, Jacobs Engineering Group, Juniper Networks, Kimco Realty, Kinder Morgan, Coca Cola Company, Kansas City Southern, Leggett & Platt, Lennar Corp., Laboratory Corp. of America Holding, LKQ Corporation, L-3 Communications Holdings, Lilly (Eli) & Co., Lockheed Martin Corp., Alliant Energy Corp, Leucadia National Corp., Southwest Airlines, Level 3 Communications, LyondellBasell, Mastercard Inc., Mid-America Apartments, Macerich, Marriott Int'l., Masco Corp., Mattel Inc., Moody's Corp, Mondelez International, MetLife Inc., Mohawk Industries, Mead Johnson, McCormick & Co., Martin Marietta Materials, Marsh & McLennan, 3M Company, Monster Beverage, Altria Group Inc, The Mosaic Company, Marathon Petroleum, Merck & Co., Marathon Oil Corp., M&T Bank Corp., Mettler Toledo, Murphy Oil, Mylan N.V., Navient, Noble Energy Inc, NASDAQ OMX Group, NextEra Energy, Newmont Mining Corp. (Hldg. Co.), Newfield Exploration Co, Nielsen Holdings, National Oilwell Varco Inc., Norfolk Southern Corp., Nucor Corp., Newell Brands, Realty Income Corporation, ONEOK, Omnicom Group, O'Reilly Automotive, Occidental Petroleum, People's United Financial, Pitney-Bowes, PACCAR Inc., PG&E Corp., Public Serv. Enterprise Inc., PepsiCo Inc., Pfizer Inc., Principal Financial Group, Procter & Gamble, Progressive Corp., Pulte Homes Inc., Philip Morris International, PNC Financial Services, Pentair Ltd., Pinnacle West Capital, PPG Industries, PPL Corp., Prudential Financial, Phillips 66, Quanta Services Inc., Praxair Inc., PayPal, Ryder System, Royal Caribbean Cruises Ltd, Regeneron, Robert Half International, Roper Industries, Range Resources Corp., Republic Services Inc, SCANA Corp, Charles Schwab Corporation, Spectra Energy Corp., Sealed Air, Sherwin-Williams, SL Green Realty, Scripps Networks Interactive Inc., Southern Co., Simon Property Group Inc, Stericycle Inc, Sempra Energy, SunTrust Banks, State Street Corp., Skyworks Solutions, Southwestern Energy, Synchrony Financial, Stryker Corp., AT&T Inc, Molson Coors Brewing Company, Teradata Corp., Tegna, Inc., Torchmark Corp., Thermo Fisher Scientific, TripAdvisor, The Travelers Companies Inc., Tractor Supply Company, Tyson Foods, Tesoro Petroleum Co., Total System Services, Texas Instruments, Under Armour, United Continental Holdings, UDR Inc, Universal Health Services, Inc., United Health Group Inc., Unum Group, Union Pacific, United Parcel Service, United Technologies, Varian Medical Systems, Valero Energy, Vulcan Materials, Vornado Realty Trust, Verisk Analytics, Verisign Inc., Vertex Pharmaceuticals Inc, Ventas Inc, Waters Corporation, Wec Energy Group Inc, Wells Fargo, Whirlpool Corp., Waste Management Inc., Williams Cos., Western Union Co, Weyerhaeuser Corp., Wyndham Worldwide, Wynn Resorts Ltd, Cimarex Energy, Xcel Energy Inc, XL Capital, Exxon Mobil Corp., Dentsply Sirona, Xerox Corp., Xylem Inc., Yahoo Inc., Yum! Brands Inc, Zimmer Biomet Holdings, Zions Bancorp, Zoetis

Principal Component Analysis and t-SNE¶

Let's see if we can use PCA and t-SNE to visualize our clusters, now that we have our cluster count. We will use both the K-means and Hierarchical methods and compare.

In [ ]:
# Copy datafrase
df_pca = df_model_scaled.copy()
Visualization: PCA with K-Means¶
In [ ]:
# Reduce to higher dimensions first with PCA, using n_components from 2 to 10, and kmeans results *cluster=5)
for i in np.arange(2, 11):

  # Initialize PCA model
  pca = PCA(n_components=i)  # or any number you want
  X_pca = pca.fit_transform(df_pca)

  # Then apply t-SNE
  tsne = TSNE(n_components=2, random_state=8)
  X_reduced = tsne.fit_transform(X_pca)

  # Perform clustering on the reduced data (using KMeans cluster count here)
  kmeans = KMeans(n_clusters=5, random_state=8)
  clusters = kmeans.fit_predict(X_reduced)

  # Put t-SNE results in a dataframe
  df_tsne = pd.DataFrame(data=X_reduced, columns=['Component 1', 'Component 2'])

  # Add the cluster labels to the dataframe
  df_tsne['Cluster'] = clusters

  # Show it in a scatterplot
  plt.figure(figsize=(10, 8))
  sns.scatterplot(data=df_tsne, x='Component 1', y='Component 2',  hue='Cluster', palette='viridis', legend='full')

  # Customize the plot
  plt.title(f't-SNE Visualization of K-Means Clusters: {i}')
  plt.xlabel('Component 1')
  plt.ylabel('Component 2')
  plt.legend(title='Cluster')
  plt.show()
Visualization: PCA with Hierarchical¶
In [ ]:
# Reduce to higher dimensions first with PCA, using n_components from 2 to 10
for i in np.arange(2, 11):

  # Initialize PCA model
  pca = PCA(n_components=i)
  X_pca = pca.fit_transform(df_pca)

  # Then apply t-SNE
  tsne = TSNE(n_components=2, random_state=8)
  X_reduced = tsne.fit_transform(X_pca)

  # Perform hierarchical clustering (AgglomerativeClustering) on the reduced data
  hierarchical = AgglomerativeClustering(n_clusters=5)  # Set number of clusters to 5
  clusters = hierarchical.fit_predict(X_reduced)

  # Put t-SNE results in a dataframe
  df_tsne = pd.DataFrame(data=X_reduced, columns=['Component 1', 'Component 2'])

  # Add the cluster labels to the dataframe
  df_tsne['Cluster'] = clusters

  # Show it in a scatterplot
  plt.figure(figsize=(10, 8))
  sns.scatterplot(data=df_tsne, x='Component 1', y='Component 2',  hue='Cluster', palette='viridis', legend='full')

  # Customize the plot
  plt.title(f't-SNE Visualization of Hierarchical Clustering (n_clusters=5): {i}')
  plt.xlabel('Component 1')
  plt.ylabel('Component 2')
  plt.legend(title='Cluster')
  plt.show()

K-means vs Hierarchical Clustering¶

Next, let's determine which approach to take: K-means or Hierarchical.

  • Execution time
  • Cluster Distinction
  • Ideal number of clusters
  • Comparision of cluster profile results

Execution Time

Hierarchical runs more than twice as fast as K-means:

  • K-means CPU times

    • user 663 ms
    • sys: 54.8 ms
    • total: 718 ms
    • wall time: 827 ms
  • Hierarchical CPU times

    • user: 267 ms
    • sys: 73.4 ms
    • total: 341 ms
    • wall time: 295 ms

Ideal Number of Clusters

  • In both cases, 5 is the ideal cluster count.

Cluster Distinction

  • Visualizing the clusters in each method for K-means, the elbow method reveals a very distinct elbow between a count of 5 and 6. For clusters 6 and higher, the gained distortion reduction tapers off.

  • For Hierarchical, our Dendogram for complete linkage is well-organized for either 5 or 6 clusters.

  • Given both inputs, we will choose 5 clusters, resulting in the following data point assignents.

    • The overall cluster shapes are very similar in shape but nearly touching one another..
    • With K-means, we have a more distinguished middle centroid. In Hierarchical clustering, these same points are allocated to neighboring clusters.
    • The outliers in the middle left section were assigned to different clusters.

project_7_cluster5_compare.png

Cluster Profile Result Comparision

K-means performed best with five clusters, allocated as follows:

  • Cluster 3: 303
  • Cluster 1: 16
  • Cluster 0: 12
  • Cluster 2: 7
  • Cluster 4: 2

Hierarchical also has five clusters, with a similar allocation. In this algorithm, more data points were assigned to Cluster 1, while other clusters saw reduced counts.

  • Cluster 1: 318
  • Cluster 0: 10
  • Cluster 2: 7
  • Cluster 3: 3
  • Cluster 4: 2

Next, let's aggregate the data by cluster so that we can compare various features per cluster.

In [ ]:
# Create cluster-specific dataframes
df_kmean_cluster_0 = df_kmean[df_kmean['Cluster'] == 0]
df_kmean_cluster_1 = df_kmean[df_kmean['Cluster'] == 1]
df_kmean_cluster_2 = df_kmean[df_kmean['Cluster'] == 2]
df_kmean_cluster_3 = df_kmean[df_kmean['Cluster'] == 3]
df_kmean_cluster_4 = df_kmean[df_kmean['Cluster'] == 4]

list_clusters = [0,1,2,3,4]
In [ ]:
# View details for Cluster means
for col in num_cols:
  print(f'\n****** Feature {col} ******')
  for cluster in list_clusters:
    cluster_data = df_kmean[df_kmean['Cluster'] == cluster]
    print(f'Cluster {cluster} mean for {col}: {cluster_data[col].mean()}')
****** Feature Current Price ******
Cluster 0 mean for Current Price: 48.73499975000001
Cluster 1 mean for Current Price: 64.210386025625
Cluster 2 mean for Current Price: 84.35571614285713
Cluster 3 mean for Current Price: 83.29782519843894
Cluster 4 mean for Current Price: 25.6400005

****** Feature Price Change ******
Cluster 0 mean for Price Change: 5.318747360916667
Cluster 1 mean for Price Change: -2.1037755721874998
Cluster 2 mean for Price Change: 3.854980541142857
Cluster 3 mean for Price Change: 4.313401507676567
Cluster 4 mean for Price Change: 11.2379082245

****** Feature Volatility ******
Cluster 0 mean for Volatility: 1.2708920399166668
Cluster 1 mean for Volatility: 2.11978607225
Cluster 2 mean for Volatility: 1.827670343428571
Cluster 3 mean for Volatility: 1.499096510122112
Cluster 4 mean for Volatility: 1.32235485

****** Feature ROE ******
Cluster 0 mean for ROE: 26.0
Cluster 1 mean for ROE: 178.5
Cluster 2 mean for ROE: 633.5714285714286
Cluster 3 mean for ROE: 19.257425742574256
Cluster 4 mean for ROE: 12.5

****** Feature Cash Ratio ******
Cluster 0 mean for Cash Ratio: 71.75
Cluster 1 mean for Cash Ratio: 64.125
Cluster 2 mean for Cash Ratio: 33.57142857142857
Cluster 3 mean for Cash Ratio: 70.7095709570957
Cluster 4 mean for Cash Ratio: 130.5

****** Feature Net Cash Flow ******
Cluster 0 mean for Net Cash Flow: -4070833333.3333335
Cluster 1 mean for Net Cash Flow: 6140875.0
Cluster 2 mean for Net Cash Flow: -568400000.0
Cluster 3 mean for Net Cash Flow: 125750287.12871288
Cluster 4 mean for Net Cash Flow: 16755500000.0

****** Feature Net Income ******
Cluster 0 mean for Net Income: 11045666666.666666
Cluster 1 mean for Net Income: -603718000.0
Cluster 2 mean for Net Income: -4968157142.857142
Cluster 3 mean for Net Income: 1295945059.4059405
Cluster 4 mean for Net Income: 13654000000.0

****** Feature Earnings Per Share ******
Cluster 0 mean for Earnings Per Share: 2.03
Cluster 1 mean for Earnings Per Share: -3.2075
Cluster 2 mean for Earnings Per Share: -10.84142857142857
Cluster 3 mean for Earnings Per Share: 3.433415841584158
Cluster 4 mean for Earnings Per Share: 3.295

****** Feature Estimated Shares Outstanding ******
Cluster 0 mean for Estimated Shares Outstanding: 3507085088.125
Cluster 1 mean for Estimated Shares Outstanding: 547905122.985625
Cluster 2 mean for Estimated Shares Outstanding: 398169036.44285715
Cluster 3 mean for Estimated Shares Outstanding: 452037260.1709571
Cluster 4 mean for Estimated Shares Outstanding: 2791829362.1

****** Feature P/E Ratio ******
Cluster 0 mean for P/E Ratio: 26.986866739666667
Cluster 1 mean for P/E Ratio: 39.2437444469375
Cluster 2 mean for P/E Ratio: 42.28454148142857
Cluster 3 mean for P/E Ratio: 32.38692420680858
Cluster 4 mean for P/E Ratio: 13.649695574999999

****** Feature P/B Ratio ******
Cluster 0 mean for P/B Ratio: -6.238311734833332
Cluster 1 mean for P/B Ratio: 0.6717558973125
Cluster 2 mean for P/B Ratio: -11.589501539285711
Cluster 3 mean for P/B Ratio: -1.4586917297656765
Cluster 4 mean for P/B Ratio: 1.5084839159999999
In [ ]:
# Merge individual dataframes and write comparisions to new table

df_all_clusters = [df_kmean_cluster_0, df_kmean_cluster_1, df_kmean_cluster_2, df_kmean_cluster_3, df_kmean_cluster_4]

mean_table = pd.DataFrame()

for idx, df in enumerate(df_all_clusters):
    temp_mean_dict = {}

    for col in num_cols:
        if pd.api.types.is_numeric_dtype(df[col]):
            other_dfs = [df_all_clusters[i] for i in range(len(df_all_clusters)) if i != idx]  # Exclude the current dataframe
            col_means = [other_df[col].mean() for other_df in other_dfs]
            temp_mean_dict[col] = sum(col_means) / len(col_means)
        else:
            temp_mean_dict[col] = 'Non-numeric'

    # Add results
    temp_mean_df = pd.DataFrame([temp_mean_dict], index=[f'DF {idx}'])
    mean_table = pd.concat([mean_table, temp_mean_df])

# Set pandas option to display float values with 6 decimal places
pd.options.display.float_format = '{:,.2f}'.format

mean_table
#import ace_tools as tools; tools.display_dataframe_to_user(name="Mean Table from Other DataFrames", dataframe=mean_table)
Out[ ]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
DF 0 64.38 4.33 1.69 210.96 74.73 4,079,747,790.53 2,344,517,479.14 -1.83 1,047,485,195.42 31.89 -2.72
DF 1 60.51 6.18 1.48 172.83 76.63 3,060,504,238.45 5,256,863,645.80 -0.52 1,787,280,186.71 28.83 -4.44
DF 2 55.47 4.69 1.55 59.06 84.27 3,204,139,457.20 6,347,973,431.52 1.39 1,824,714,208.35 28.07 -1.38
DF 3 55.74 4.58 1.64 212.64 74.99 3,030,601,885.42 4,781,947,880.95 -2.18 1,811,247,152.41 30.54 -3.91
DF 4 70.15 2.85 1.68 214.33 60.04 -1,126,835,542.80 1,692,434,145.80 -2.15 1,226,299,126.93 35.23 -4.65
In [ ]:
# Loop through the selected columns and plot boxplots in two columns

n_cols = 2

# Iterate through numberical columns
for i, col in enumerate(list_numcol):
    if col in df_kmean.columns:

        # Set up subplots
        fig, axs = plt.subplots(1, n_cols, figsize=(16, 5))

        # Create boxplot
        sns.boxplot(data=df_kmean, x='Cluster', y=col, ax=axs[0], hue='Cluster', palette='viridis')
        axs[0].set_title(f'Boxplot of {col} by Cluster')
        axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=90)

        if i + 1 < len(list_numcol):
            next_col = list_numcol[i + 1]
            if next_col in df_kmean.columns:
                sns.boxplot(data=df_kmean, x='Cluster', y=next_col, ax=axs[1], hue='Cluster', palette='viridis')
                axs[1].set_title(f'Boxplot of {next_col} by Cluster')
                axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=90)
        else:
            axs[1].axis('off')

        plt.tight_layout()
        plt.show()

        if i + 1 < len(list_numcol):
            i += 1
In [ ]:
# Show the means
df_means_by_cluster = df_kmean.groupby('Cluster')[list_numcol].mean()
df_means_by_cluster = df_means_by_cluster.T  # Transpose to get clusters as columns
df_means_by_cluster
Out[ ]:
Cluster 0 1 2 3 4
Current Price 48.73 64.21 84.36 83.30 25.64
Price Change 5.32 -2.10 3.85 4.31 11.24
Volatility 1.27 2.12 1.83 1.50 1.32
ROE 26.00 178.50 633.57 19.26 12.50
Cash Ratio 71.75 64.12 33.57 70.71 130.50
Net Cash Flow -4,070,833,333.33 6,140,875.00 -568,400,000.00 125,750,287.13 16,755,500,000.00
Net Income 11,045,666,666.67 -603,718,000.00 -4,968,157,142.86 1,295,945,059.41 13,654,000,000.00
Earnings Per Share 2.03 -3.21 -10.84 3.43 3.29
Estimated Shares Outstanding 3,507,085,088.12 547,905,122.99 398,169,036.44 452,037,260.17 2,791,829,362.10
P/E Ratio 26.99 39.24 42.28 32.39 13.65
P/B Ratio -6.24 0.67 -11.59 -1.46 1.51

Finally, let's display boxplots of the model and the means for each numerical metric, so we can compare clusters.

In [ ]:
# Plot boxplots with mean
for col in list_numcol:
    if col in df_kmean.columns:
        plt.figure(figsize=(8, 5))
        sns.boxplot(data=df_kmean, x='Cluster', y=col, hue='Cluster', palette='viridis', showmeans=False)

        # Use an overlaid  pointplot to show means
        sns.pointplot(data=df_kmean, x='Cluster', y=col, hue='Cluster', dodge=True, markers='D',
                      scale=0.75, ci=None, palette='dark', legend=False)

        plt.xticks(rotation=90)
        plt.title(f'Boxplot of {col} by Cluster')
        plt.show()

Final Model¶

Finally, let's determine which algorithm worked the best.

  • For both K-means and Hierarchical, the plot density decreases as the number of clusters (K-means) or components (Hierarchical) increases.

  • Since no extra definitive information surfaced during our PCA analysis, we will retain K-means and 5 clusters as our final model. This more evenly allocated the data points into different clusters, and since the dataset is small, we are not as concerned with increased execution time.

  • Some of the clusters have notable differences within the feature, as we can see in the boxplots. Other clusters show little difference, indicating that the feature may not be that important. Without normalizing the y-axis, it's difficult to compare them relative to one another.

  • Next, we will revisit the numerical results to determine the characteristics of various clusters.

In [ ]:
# Re-display final results from our k-means analysis
mean_table
Out[ ]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
DF 0 64.38 4.33 1.69 210.96 74.73 4,079,747,790.53 2,344,517,479.14 -1.83 1,047,485,195.42 31.89 -2.72
DF 1 60.51 6.18 1.48 172.83 76.63 3,060,504,238.45 5,256,863,645.80 -0.52 1,787,280,186.71 28.83 -4.44
DF 2 55.47 4.69 1.55 59.06 84.27 3,204,139,457.20 6,347,973,431.52 1.39 1,824,714,208.35 28.07 -1.38
DF 3 55.74 4.58 1.64 212.64 74.99 3,030,601,885.42 4,781,947,880.95 -2.18 1,811,247,152.41 30.54 -3.91
DF 4 70.15 2.85 1.68 214.33 60.04 -1,126,835,542.80 1,692,434,145.80 -2.15 1,226,299,126.93 35.23 -4.65

Actionable Insights and Recommendations¶

As the last step in our analysis, we can examine details of each cluster resulting from K-means with 5 clusters.

Cluster 0¶

This cluster's data encapsulates large, major companies with moderate growth potential and strong profitability, but possibly poor management of cash.

Metrics

  • Current Price: 48.73
  • Price Change: +5.32 (positive momentum)
  • Volatility: 1.27 (stable)
  • Return on Equity (ROE): 26% (good profitability and use of equity)
  • Cash Ratio: 71.75 (sufficient cash cushion)
  • Net Cash Flow: -4.07 billion (high value; possibly indiciating poor cash management despite being profitable)
  • Net Income: 11.05 billion (good earnings)
  • Earnings Per Share (EPS): $2.03 (good earnings)
  • P/E Ratio: 26.99 (high; may be poised for growth)
  • P/B Ratio: -6.24 (needs more research)

Companies

Anadarko Petroleum Corp, Bristol-Myers Squibb, Citigroup Inc., eBay Inc., General Motors, JPMorgan Chase & Co., Coca Cola Company, Pfizer Inc., AT&T Inc, Verizon Communications, Wells Fargo, Exxon Mobil Corp.

Sectors and Sub Industries

In [ ]:
# @title
# Show pivot of sectors and subsectors with counts
df_kmean_cluster_0_pivot = df_kmean_cluster_0.pivot_table(
    index=['GICS Sector', 'GICS Sub Industry'],
    values='Security',
    aggfunc={'Security': 'count'},
    fill_value=0
).reset_index()

# Rename columns
df_kmean_cluster_0_pivot.columns = ['GICS Sector', 'GICS Sub Industry', 'Company Count']

# Calculate grand total
grand_total_count = df_kmean_cluster_0_pivot['Company Count'].sum()
grand_total_df = pd.DataFrame({
    'GICS Sector': ['Grand Total'],
    'GICS Sub Industry': [''],
    'Company Count': [grand_total_count]
})

# Combine counts with grand total
combined = pd.concat([df_kmean_cluster_0_pivot, grand_total_df], ignore_index=True)

# Display the combined DataFrame
print('Cluster 0: Sector, Sub Industry and Company Count')
combined
Cluster 0: Sector, Sub Industry and Company Count
Out[ ]:
GICS Sector GICS Sub Industry Company Count
0 Consumer Discretionary Automobile Manufacturers 1
1 Consumer Staples Soft Drinks 1
2 Energy Integrated Oil & Gas 1
3 Energy Oil & Gas Exploration & Production 1
4 Financials Banks 3
5 Health Care Health Care Distributors 1
6 Health Care Pharmaceuticals 1
7 Information Technology Internet Software & Services 1
8 Telecommunications Services Integrated Telecommunications Services 2
9 Grand Total 12
Cluster 1¶

Companies in this cluster are large but have poor earnings performance and show high volatility. Seeing both a negative earnings and declining stock price could indicate underlying problems. Investing in companies in the cluster requires research to find the ones with potential for turnaround.

Metrics

  • Current Price: 64.21
  • Price Change: -2.10 (slight decline)
  • Volatility: 2.12 (shows uncertainty in market)
  • Return on Equity (ROE): 178.5% (very high; requires research)
  • Cash Ratio: 64.12 (good liquidity)
  • Net Cash Flow: 6.14 million (positive cash flow)
  • Net Income: -603.7 million (major losses)
  • Earnings Per Share (EPS): -3.21 (missed earnings)
  • P/E Ratio: 39.24 (very high - could be a good investment if recovery starts to surface)
  • P/B Ratio: 0.67 (undervalued)

Companies

American Airlines Group, AbbVie, Devon Energy Corp., Freeport-McMoran Cp & Gld, IDEXX Laboratories, Masco Corp., Moody's Corp, Altria Group Inc, Newfield Exploration Co, Pitney-Bowes, Sherwin-Williams, Southwestern Energy, United Parcel Service, Wynn Resorts Ltd, Cimarex Energy, Yum! Brands

Sectors and Sub Industries

In [ ]:
# @title
# Show pivot of sectors and subsectors with counts
df_kmean_cluster_1_pivot = df_kmean_cluster_1.pivot_table(
    index=['GICS Sector', 'GICS Sub Industry'],
    values='Security',
    aggfunc={'Security': 'count'},
    fill_value=0
).reset_index()

# Rename columns
df_kmean_cluster_1_pivot.columns = ['GICS Sector', 'GICS Sub Industry', 'Company Count']

# Calculate grand total
grand_total_count = df_kmean_cluster_1_pivot['Company Count'].sum()
grand_total_df = pd.DataFrame({
    'GICS Sector': ['Grand Total'],
    'GICS Sub Industry': [''],
    'Company Count': [grand_total_count]
})

# Combine counts with grand total
combined = pd.concat([df_kmean_cluster_1_pivot, grand_total_df], ignore_index=True)

# Display the combined DataFrame
print('Cluster 1: Sector, Sub Industry and Company Count')
combined
Cluster 1: Sector, Sub Industry and Company Count
Out[ ]:
GICS Sector GICS Sub Industry Company Count
0 Consumer Discretionary Casinos & Gaming 1
1 Consumer Discretionary Restaurants 1
2 Consumer Staples Tobacco 1
3 Energy Oil & Gas Exploration & Production 4
4 Financials Diversified Financial Services 1
5 Health Care Health Care Equipment 1
6 Health Care Pharmaceuticals 1
7 Industrials Air Freight & Logistics 1
8 Industrials Airlines 1
9 Industrials Building Products 1
10 Industrials Technology, Hardware, Software and Supplies 1
11 Materials Copper 1
12 Materials Specialty Chemicals 1
13 Grand Total 16
Cluster 2¶

This cluster consists of high-risk companies with significant losses and poor liquidity. Even though they show positive price movement, the company may have weak fundamentals. Investments should be limited and/or carefully researched for these companies.

Metrics

  • Current Price: 84.36
  • Price Change: +3.85 (positive momentum)
  • Volatility: 1.83 (moderate)
  • Return on Equity (ROE): 633.57% (exceptionally high)
  • Cash Ratio: 33.57 (low liquidity)
  • Net Cash Flow: -568.4 million (negative cash flow)
  • Net Income: -4.97 billion (significant losses)
  • Earnings Per Share (EPS): -10.84 (bad earnings)
  • P/E Ratio: 42.28 (may find growth, but risky)
  • P/B Ratio: -11.59 (not a good signal; could be liabilities)

Companies

Allegion, Apache Corporation, Chesapeake Energy, Charter Communications, Colgate-Palmolive, Kimberly-Clark, S&P Global, Inc.

Sectors and Sub Industries

In [ ]:
# @title
# Show pivot of sectors and subsectors with counts
df_kmean_cluster_2_pivot = df_kmean_cluster_2.pivot_table(
    index=['GICS Sector', 'GICS Sub Industry'],
    values='Security',
    aggfunc={'Security': 'count'},
    fill_value=0
).reset_index()

# Rename columns
df_kmean_cluster_2_pivot.columns = ['GICS Sector', 'GICS Sub Industry', 'Company Count']

# Calculate grand total
grand_total_count = df_kmean_cluster_2_pivot['Company Count'].sum()
grand_total_df = pd.DataFrame({
    'GICS Sector': ['Grand Total'],
    'GICS Sub Industry': [''],
    'Company Count': [grand_total_count]
})

# Combine counts with grand total
combined = pd.concat([df_kmean_cluster_2_pivot, grand_total_df], ignore_index=True)

# Display the combined DataFrame
print('Cluster 2: Sector, Sub Industry and Company Count')
combined
Cluster 2: Sector, Sub Industry and Company Count
Out[ ]:
GICS Sector GICS Sub Industry Company Count
0 Consumer Discretionary Cable & Satellite 1
1 Consumer Staples Household Products 2
2 Energy Integrated Oil & Gas 1
3 Energy Oil & Gas Exploration & Production 1
4 Financials Diversified Financial Services 1
5 Industrials Building Products 1
6 Grand Total 7
Cluster 3¶

This cluster is by far the largest one. Companies falling into this cluster have good liquidity, stable performance, and are profitable. Investment comes with moderate risk but potential for growth.

Metrics

  • Current Price: 83.30
  • Price Change: +4.31 (positive price momentum)
  • Volatility: 1.50 (stable)
  • Return on Equity (ROE): 19.26% (good)
  • Cash Ratio: 70.71 (good liquidity)
  • Net Cash Flow: 125.75 million (good cash flow management)
  • Net Income: 1.29 billion (profitable)
  • Earnings Per Share (EPS): 3.43 (strong earnings performance)
  • P/E Ratio: 32.39 (some risk, but also growth potential)
  • P/B Ratio: -1.46 (needs research)

Companies

Abbott Laboratories, Adobe Systems Inc, Analog Devices, Inc., Archer-Daniels-Midland Co, Alliance Data Systems, Ameren Corp, American Electric Power, AFLAC Inc, American International Group, Inc., Apartment Investment & Mgmt, Assurant Inc, Arthur J. Gallagher & Co., Akamai Technologies Inc, Albemarle Corp, Alaska Air Group Inc, Allstate Corp, Alexion Pharmaceuticals, Applied Materials Inc, AMETEK Inc, Affiliated Managers Group Inc, Amgen Inc, Ameriprise Financial, American Tower Corp A, Amazon.com Inc, AutoNation Inc, Anthem Inc., Aon plc, Amphenol Corp, Arconic Inc, Activision Blizzard, AvalonBay Communities, Inc., Broadcom, American Water Works Company Inc, American Express Co, Boeing Company, Baxter International Inc., BB&T Corporation, Bard (C.R.) Inc., Baker Hughes Inc, BIOGEN IDEC Inc., The Bank of New York Mellon Corp., Ball Corp, Boston Scientific, BorgWarner, Boston Properties, Caterpillar Inc., Chubb Limited, CBRE Group, Crown Castle International Corp., Carnival Corp., Celgene Corp., CF Industries Holdings Inc, Citizens Financial Group, Church & Dwight, C. H. Robinson Worldwide, CIGNA Corp., Cincinnati Financial, Comerica Inc., CME Group Inc., Chipotle Mexican Grill, Cummins Inc., CMS Energy, Centene Corporation, CenterPoint Energy, Capital One Financial, Cabot Oil & Gas, The Cooper Companies, CSX Corp., CenturyLink Inc, Cognizant Technology Solutions, Citrix Systems, CVS Health, Chevron Corp., Concho Resources, Dominion Resources, Delta Air Lines, Du Pont (E.I.), Deere & Co., Discover Financial Services, Quest Diagnostics, Danaher Corp., The Walt Disney Company, Discovery Communications-A, Discovery Communications-C, Delphi Automotive, Digital Realty Trust, Dun & Bradstreet, Dover Corp., Dr Pepper Snapple Group, Duke Energy, DaVita Inc., Ecolab Inc., Consolidated Edison, Equifax Inc., Edison Int'l, Eastman Chemical, EOG Resources, Equinix, Equity Residential, EQT Corporation, Eversource Energy, Essex Property Trust, Inc., E*Trade, Eaton Corporation, Entergy Corp., Edwards Lifesciences, Exelon Corp., Expeditors Int'l, Expedia Inc., Extra Space Storage, Ford Motor, Fastenal Co, Facebook, Fortune Brands Home & Security, FirstEnergy Corp, Fidelity National Information Services, Fiserv Inc, FLIR Systems, Fluor Corp., Flowserve Corporation, FMC Corporation, Federal Realty Investment Trust, First Solar Inc, Frontier Communications, General Dynamics, General Growth Properties Inc., Gilead Sciences, Corning Inc., Genuine Parts, Garmin Ltd., Goodyear Tire & Rubber, Grainger (W.W.) Inc., Halliburton Co., Hasbro Inc., Huntington Bancshares, HCA Holdings, Welltower Inc., HCP Inc., Hess Corporation, Hartford Financial Svc.Gp., Harley-Davidson, Honeywell Int'l Inc., Hewlett Packard Enterprise, HP Inc., Hormel Foods Corp., Henry Schein, Host Hotels & Resorts, The Hershey Company, Humana Inc., International Business Machines, Intl Flavors & Fragrances, International Paper, Interpublic Group, Iron Mountain Incorporated, Intuitive Surgical Inc., Illinois Tool Works, Invesco Ltd., J. B. Hunt Transport Services, Jacobs Engineering Group, Juniper Networks, Kimco Realty, Kinder Morgan, Kansas City Southern, Leggett & Platt, Lennar Corp., Laboratory Corp. of America Holding, LKQ Corporation, L-3 Communications Holdings, Lilly (Eli) & Co., Lockheed Martin Corp., Alliant Energy Corp, Leucadia National Corp., Southwest Airlines, Level 3 Communications, LyondellBasell, Mastercard Inc., Mid-America Apartments, Macerich, Marriott Int'l., Mattel Inc., McDonald's Corp., Mondelez International, MetLife Inc., Mohawk Industries, Mead Johnson, McCormick & Co., Martin Marietta Materials, Marsh & McLennan, 3M Company, Monster Beverage, The Mosaic Company, Marathon Petroleum, Merck & Co., Marathon Oil Corp., M&T Bank Corp., Mettler Toledo, Murphy Oil, Mylan N.V., Navient, Noble Energy Inc, NASDAQ OMX Group, NextEra Energy, Newmont Mining Corp. (Hldg. Co.), Netflix Inc., Nielsen Holdings, National Oilwell Varco Inc., Norfolk Southern Corp., Northern Trust Corp., Nucor Corp., Newell Brands, Realty Income Corporation, ONEOK, Omnicom Group, O'Reilly Automotive, Occidental Petroleum, People's United Financial, PACCAR Inc., PG&E Corp., Priceline.com Inc, Public Serv. Enterprise Inc., PepsiCo Inc., Principal Financial Group, Procter & Gamble, Progressive Corp., Pulte Homes Inc., Philip Morris International, PNC Financial Services, Pentair Ltd., Pinnacle West Capital, PPG Industries, PPL Corp., Prudential Financial, Phillips 66, Quanta Services Inc., Praxair Inc., PayPal, Ryder System, Royal Caribbean Cruises Ltd, Regeneron, Robert Half International, Roper Industries, Range Resources Corp., Republic Services Inc, SCANA Corp, Charles Schwab Corporation, Spectra Energy Corp., Sealed Air, SL Green Realty, Scripps Networks Interactive Inc., Southern Co., Simon Property Group Inc, Stericycle Inc, Sempra Energy, SunTrust Banks, State Street Corp., Skyworks Solutions, Synchrony Financial, Stryker Corp., Molson Coors Brewing Company, Teradata Corp., Tegna, Inc., Torchmark Corp., Thermo Fisher Scientific, TripAdvisor, The Travelers Companies Inc., Tractor Supply Company, Tyson Foods, Tesoro Petroleum Co., Total System Services, Texas Instruments, Under Armour, United Continental Holdings, UDR Inc, Universal Health Services, Inc., United Health Group Inc., Unum Group, Union Pacific, United Technologies, Varian Medical Systems, Valero Energy, Vulcan Materials, Vornado Realty Trust, Verisk Analytics, Verisign Inc., Vertex Pharmaceuticals Inc, Ventas Inc, Waters Corporation, Wec Energy Group Inc, Whirlpool Corp., Waste Management Inc., Williams Cos., Western Union Co, Weyerhaeuser Corp., Wyndham Worldwide, Xcel Energy Inc, XL Capital, Dentsply Sirona, Xerox Corp., Xylem Inc., Yahoo Inc., Zimmer Biomet Holdings, Zions Bancorp, Zoetis

Sectors and Sub Industries

In [ ]:
# @title
# Show all rows in the DataFrame
pd.set_option('display.max_rows', None)

# Show pivot of sectors and subsectors with counts
df_kmean_cluster_3_pivot = df_kmean_cluster_3.pivot_table(
    index=['GICS Sector', 'GICS Sub Industry'],
    values='Security',
    aggfunc={'Security': 'count'},
    fill_value=0
).reset_index()

# Rename columns
df_kmean_cluster_3_pivot.columns = ['GICS Sector', 'GICS Sub Industry', 'Company Count']

# Calculate grand total
grand_total_count = df_kmean_cluster_3_pivot['Company Count'].sum()
grand_total_df = pd.DataFrame({
    'GICS Sector': ['Grand Total'],
    'GICS Sub Industry': [''],
    'Company Count': [grand_total_count]
})

# Combine counts with grand total
combined = pd.concat([df_kmean_cluster_3_pivot, grand_total_df], ignore_index=True)

# Display the combined DataFrame
print('Cluster 3: Sector, Sub Industry and Company Count')
combined
Cluster 3: Sector, Sub Industry and Company Count
Out[ ]:
GICS Sector GICS Sub Industry Company Count
0 Consumer Discretionary Advertising 2
1 Consumer Discretionary Apparel, Accessories & Luxury Goods 1
2 Consumer Discretionary Auto Parts & Equipment 2
3 Consumer Discretionary Automobile Manufacturers 1
4 Consumer Discretionary Broadcasting & Cable TV 2
5 Consumer Discretionary Cable & Satellite 2
6 Consumer Discretionary Consumer Electronics 1
7 Consumer Discretionary Distributors 1
8 Consumer Discretionary Home Furnishings 1
9 Consumer Discretionary Homebuilding 2
10 Consumer Discretionary Hotels, Resorts & Cruise Lines 4
11 Consumer Discretionary Household Appliances 1
12 Consumer Discretionary Housewares & Specialties 1
13 Consumer Discretionary Internet & Direct Marketing Retail 4
14 Consumer Discretionary Leisure Products 2
15 Consumer Discretionary Motorcycle Manufacturers 1
16 Consumer Discretionary Publishing 1
17 Consumer Discretionary Restaurants 2
18 Consumer Discretionary Specialty Retail 1
19 Consumer Discretionary Specialty Stores 3
20 Consumer Discretionary Tires & Rubber 1
21 Consumer Staples Agricultural Products 1
22 Consumer Staples Brewers 1
23 Consumer Staples Drug Retail 1
24 Consumer Staples Household Products 1
25 Consumer Staples Packaged Foods & Meats 6
26 Consumer Staples Personal Products 1
27 Consumer Staples Soft Drinks 3
28 Consumer Staples Tobacco 1
29 Energy Integrated Oil & Gas 3
30 Energy Oil & Gas Equipment & Services 3
31 Energy Oil & Gas Exploration & Production 10
32 Energy Oil & Gas Refining & Marketing & Transportation 6
33 Financials Asset Management & Custody Banks 4
34 Financials Banks 6
35 Financials Consumer Finance 5
36 Financials Diversified Financial Services 5
37 Financials Financial Exchanges & Data 1
38 Financials Insurance Brokers 3
39 Financials Investment Banking & Brokerage 2
40 Financials Life & Health Insurance 3
41 Financials Multi-Sector Holdings 1
42 Financials Multi-line Insurance 1
43 Financials Property & Casualty Insurance 8
44 Financials Regional Banks 3
45 Financials Thrifts & Mortgage Finance 1
46 Health Care Biotechnology 7
47 Health Care Health Care Distributors 2
48 Health Care Health Care Equipment 10
49 Health Care Health Care Facilities 5
50 Health Care Health Care Supplies 2
51 Health Care Life Sciences Tools & Services 1
52 Health Care Managed Health Care 5
53 Health Care Pharmaceuticals 4
54 Industrials Aerospace & Defense 4
55 Industrials Air Freight & Logistics 2
56 Industrials Airlines 4
57 Industrials Building Products 2
58 Industrials Construction & Farm Machinery & Heavy Trucks 3
59 Industrials Diversified Commercial Services 1
60 Industrials Electrical Components & Equipment 1
61 Industrials Environmental Services 1
62 Industrials Human Resource & Employment Services 1
63 Industrials Industrial Conglomerates 14
64 Industrials Industrial Machinery 5
65 Industrials Industrial Materials 1
66 Industrials Railroads 4
67 Industrials Research & Consulting Services 4
68 Industrials Trucking 1
69 Information Technology Application Software 2
70 Information Technology Computer Hardware 1
71 Information Technology Data Processing & Outsourced Services 2
72 Information Technology Electronic Components 2
73 Information Technology Electronic Equipment & Instruments 1
74 Information Technology Home Entertainment Software 1
75 Information Technology IT Consulting & Other Services 3
76 Information Technology Internet Software & Services 11
77 Information Technology Networking Equipment 1
78 Information Technology Semiconductor Equipment 1
79 Information Technology Semiconductors 5
80 Information Technology Technology Hardware, Storage & Peripherals 1
81 Materials Construction Materials 2
82 Materials Diversified Chemicals 5
83 Materials Fertilizers & Agricultural Chemicals 2
84 Materials Gold 1
85 Materials Industrial Gases 1
86 Materials Metal & Glass Containers 1
87 Materials Paper Packaging 2
88 Materials Specialty Chemicals 3
89 Materials Steel 1
90 Real Estate Office REITs 1
91 Real Estate REITs 14
92 Real Estate Real Estate Services 1
93 Real Estate Residential REITs 4
94 Real Estate Retail REITs 4
95 Real Estate Specialized REITs 3
96 Telecommunications Services Alternative Carriers 1
97 Telecommunications Services Integrated Telecommunications Services 2
98 Utilities Electric Utilities 12
99 Utilities MultiUtilities 11
100 Utilities Water Utilities 1
101 Grand Total 303
Cluster 4¶

Only two companies fell this cluster, and both are likely good long-term investments. They appear to be in excellent financial health, have strong cash flow, and also solid earnings. Their relatively low P/E ratio suggest they are undervalued and could continue to grow.

Metrics

  • Current Price: 25.64 (lower end of the clusters)
  • Price Change: +11.24 (strong momentum)
  • Volatility: 1.32 (low/stable) *Return on Equity (ROE): 12.50% (healthy)
  • Cash Ratio: 130.50 (good liquidity)
  • Net Cash Flow: 16.76 billion (strong cash flow)
  • Net Income: 13.65 billion (strong earnings)
  • Earnings Per Share (EPS): 3.29 (strong earnings performance)
  • P/E Ratio: 13.65 (low value indicates undervaluation)
  • P/B Ratio: 1.51 (fair market value)

Companies

Bank of America Corp, Intel Corp.

Sectors and Sub Industries

In [ ]:
# @title
# Show pivot of sectors and subsectors with counts
df_kmean_cluster_4_pivot = df_kmean_cluster_4.pivot_table(
    index=['GICS Sector', 'GICS Sub Industry'],
    values='Security',
    aggfunc={'Security': 'count'},
    fill_value=0
).reset_index()

# Rename columns
df_kmean_cluster_4_pivot.columns = ['GICS Sector', 'GICS Sub Industry', 'Company Count']

# Calculate grand total
grand_total_count = df_kmean_cluster_4_pivot['Company Count'].sum()
grand_total_df = pd.DataFrame({
    'GICS Sector': ['Grand Total'],
    'GICS Sub Industry': [''],
    'Company Count': [grand_total_count]
})

# Combine counts with grand total
combined = pd.concat([df_kmean_cluster_4_pivot, grand_total_df], ignore_index=True)

# Display the combined DataFrame
print('Cluster 4: Sector, Sub Industry and Company Count')
combined
Cluster 4: Sector, Sub Industry and Company Count
Out[ ]:
GICS Sector GICS Sub Industry Company Count
0 Financials Banks 1
1 Information Technology Semiconductors 1
2 Grand Total 2

Summary¶

Short-Term Investments

For investors looking for short-term profits (weeks or months), the focus will be on price momentum and volatility, as well as liquidity to handle short-term financial commitments.

Good:

  • Cluster 4: Strong growth with low volatility and positive price momentum.
  • Cluster 3: Moderate growth with positive price change and solid cash standing with less risk.

Risky:

  • Cluster 0 has negative cash flow.
  • Clusters 1 and 2 show higher volatility and risk

Longer-Term Investments

For investors who are willing to invest and wait (a year or longer),

Good:

  • Cluster 4: Has strong financial health, excellent liquidity, and good growth. Low P/E ratio indicates an undervalued stock.
  • Cluster 3: Show both growth and positive cash flow, and low to medium volatility.

Risky:

  • Cluster 0: May work for long-term investment, but a negative cash flow doesn't bode well.
  • Clusters 1 and 2 both have poor financial health and weak fundamentals.

Export Notebook¶

In [ ]:
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4)
Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.12.3)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.4)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.2)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.5)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.10.0)
Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.10.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (24.1)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.1)
Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.18.0)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.3.0)
Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.3.6)
Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12)
Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.20.0)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.23.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.6)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.20.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (24.0.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.3)
In [107]:
# Create an HTML version of this notebook - do this last
%%shell
jupyter nbconvert --to html '/content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/7-Unsupervised Learning/Project 7/USL_Project_LearnerNotebook_FullCode_Tuesday_Frase_vFinal.ipynb'
[NbConvertApp] Converting notebook /content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/7-Unsupervised Learning/Project 7/USL_Project_LearnerNotebook_FullCode_Tuesday_Frase_vFinal.ipynb to html
[NbConvertApp] Writing 11319755 bytes to /content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/7-Unsupervised Learning/Project 7/USL_Project_LearnerNotebook_FullCode_Tuesday_Frase_vFinal.html
Out[107]: